Notebook Overview¶
In this notebook, we’ll demonstrate how to automatically generate, evaluate, and refine question answer (QA) pairs from two complementary cleantech corpora media articles and patent filings to build a robust retrieval-augmented generation (RAG) system.
To achieve this, we employed a series of NLP techniques and evaluation strategies:
Data Selection & Preparation: We curated 50–100 relevant paragraphs from cleantech media and patent datasets, ensuring diversity and relevance to key topics.
QA Pair Generation: Using LLMs, we generated fact-based QA pairs from the selected paragraphs, ensuring close alignment with the source content.
QA Pair Classification: QA pairs were classified into categories (e.g., Factual, Comparative) using rule-based and embedding methods based on semantic similarity to predefined prototypes.
LLM-as-a-Judge Evaluation: Recognizing that not every generated QA pair would be of equal quality, we then evaluated each pair using LLMbased critique agents. These agents assessed three key criteria:
- Groundedness: How clearly can the question be answered based on the given context?
- Relevance: How useful is the question for developing cleantech NLP applications?
- Standalone Clarity: Does the question make sense independently, without needing extra context?
For each criterion, the LLM provided both a numerical score (1 to 5) and detailed feedback, which helped us identify and filter out lower-quality QA pairs.
Filtering & Analysis: QA pairs scoring at least 3 out of 5 on all criteria were retained, and the dataset's quality was visualized for further refinement.
1. Introduction & Learning Goals¶
We started with the goal of creating a comprehensive QA dataset from cleantech texts, extracting 50–100 high quality paragraphs from the Cleantech Media and Google Patent datasets. These paragraphs cover various cleantech topics, ensuring diverse and relevant QA pairs.
This notebook walks you through generating high quality QA pairs, emphasizing the importance of data quality and robust evaluation in NLP applications.
2. Data Preparation & Visualization¶
Set the OpenRouter.api Key¶
This Key is used for the following tasks:
- Generating the QA pairs with Llama 3.2 and DeepSeek R1.
- Evaluating the QA pairs with Llama 3.2 as a judge.
%%writefile .env
api_key = "<your-api-key>"
Writing .env
Installing Dependencies¶
⚠️ Important:
Google Colab comes with a pre-installed NumPy (often an older 1.21.x release) that your notebook’s C-extension packages (like gensim) may have been compiled against. To avoid ABI mismatches, we first have to uninstall numppy and then install the libraries
After the installations complete, you must restart the notebook’s runtime (Runtime → Restart runtime) so that no old C-extension binaries remain in memory and the newly compiled versions are loaded correctly. After you have restart the notebook you can import the libraries below
!pip uninstall -y numpy
!pip install numpy==1.26.4
Found existing installation: numpy 1.26.4 Uninstalling numpy-1.26.4: Successfully uninstalled numpy-1.26.4
!pip install -q gensim==4.3.3 tiktoken seaborn spacy langchain openai sentence-transformers python-dotenv gdown pandas matplotlib nltk kagglehub
!python -m spacy download en_core_web_sm
Loading the Dependencies¶
import gensim
import gdown
import os
import warnings
from dotenv import load_dotenv
import pandas as pd
import numpy as np
import ast
from collections import Counter
import matplotlib.pyplot as plt
import seaborn as sns
import tiktoken
import spacy
from openai import OpenAI
from tqdm.auto import tqdm
from typing import Optional, List, Tuple
import json
import textwrap
from pathlib import Path
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.docstore.document import Document as LangchainDocument
pd.set_option('display.max_colwidth', 100)
nlp = spacy.load("en_core_web_sm")
encoding = tiktoken.encoding_for_model("gpt-4")
import nltk
import string
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer, PorterStemmer
nltk.download('punkt_tab')
nltk.download('stopwords')
nltk.download('wordnet')
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()
stemmer = PorterStemmer()
warnings.filterwarnings("ignore")
Setting up our LLM¶
load_dotenv()
client = OpenAI(
base_url="https://openrouter.ai/api/v1",
api_key=os.getenv("api_key"),
)
Loading Datasets with KaggleHub¶
We’ll use the KaggleHub adapter to fetch both our cleantech media CSVs and Google Patent JSON files directly into pandas DataFrames:
Media Data:
articles_dfloadscleantech_media_dataset_v3_2024-10-28.csvfrom thejannalipenkova/cleantech-media-dataset.human_evalloads the evaluation filecleantech_rag_evaluation_data_2024-09-20.csv(semicolon-delimited, Latin-1 encoding) as our gold-standard QA dataset.
Patent Data:
We iterate through the JSON files inprakharbhandari20/cleantech-google-patent-datasetparsing each as JSON Lines Then we selectCleanTech_22-24_updated.jsonaspatent_dffor all patent-level analysis.
This approach ensures our data is streamed in consistently and ready for preprocessing and QA-pair generation.
import kagglehub
from kagglehub import KaggleDatasetAdapter
file_path_media_dataset_v3_2024_10_28 = "cleantech_media_dataset_v3_2024-10-28.csv"
articles_df = kagglehub.load_dataset(
KaggleDatasetAdapter.PANDAS,
"jannalipenkova/cleantech-media-dataset",
file_path_media_dataset_v3_2024_10_28,
)
file_path_media_dataset_v3_2024_09_20 = "cleantech_rag_evaluation_data_2024-09-20.csv"
human_eval = kagglehub.load_dataset(
KaggleDatasetAdapter.PANDAS,
"jannalipenkova/cleantech-media-dataset",
file_path_media_dataset_v3_2024_09_20,
pandas_kwargs={
"sep": ";",
"engine": "python",
"encoding": "latin-1",
}
)
file_paths = [
'bq-results-20240124-055833-1706076079048.json',
'CleanTech_22-24.json',
'CleanTech_22-24_updated.json'
]
dfs = {}
for file_path in file_paths:
df = kagglehub.load_dataset(
KaggleDatasetAdapter.PANDAS,
"prakharbhandari20/cleantech-google-patent-dataset",
"bq-results-20240124-055833-1706076079048.json",
pandas_kwargs={
"lines": True,
})
dfs[file_path] = df
patent_df = dfs['CleanTech_22-24_updated.json']
Loading Pregenerated QA & Patent Datasets¶
Generating QA pairs and processing the full patent corpus can be time-consuming. To speed things up, we’ve precreated some datasets needed through the notebook, bundled together in datasets.zip. Here we simply:
- Extract all files from
datasets.zipinto our working directory. - Clean up by removing the zip file, keeping our workspace tidy.
With this step complete, you’ll have immediate access to the ready-to-use QA and patent data for the rest of the notebook.
file_id = "1hT_VCmRUEw_AN4AzRBJHEKSBHnUpD6Oo"
url = f"https://drive.google.com/uc?id={file_id}"
gdown.download(url, output="datasets.zip", quiet=False)
'datasets.zip'
!unzip -q datasets.zip
import zipfile
with zipfile.ZipFile("datasets.zip", "r") as zip_file:
zip_file.extractall()
!rm datasets.zip
Previewing the Datasets¶
Let’s take a quick glance at the first five rows of our datasets to verify our load and see the key columns we’ll work with. Since a full EDA and preprocessing workflow (cleaning, tokenization, feature engineering, etc.) is already implemented in the original notebook (https://github.com/LuciferUchiha/Cleantech-RAG/blob/main/cleantech_rag.ipynb), we’ll skip any heavy data exploration here and concentrate on crafting high quality QA pairs.
articles_df.head(5)
| Unnamed: 0 | title | date | author | content | domain | url | |
|---|---|---|---|---|---|---|---|
| 0 | 93320 | XPeng Delivered ~100,000 Vehicles In 2021 | 2022-01-02 | NaN | ['Chinese automotive startup XPeng has shown one of the most dramatic auto production ramp-ups i... | cleantechnica | https://cleantechnica.com/2022/01/02/xpeng-delivered-100000-vehicles-in-2021/ |
| 1 | 93321 | Green Hydrogen: Drop In Bucket Or Big Splash? | 2022-01-02 | NaN | ['Sinopec has laid plans to build the largest green hydrogen production facility in the world, b... | cleantechnica | https://cleantechnica.com/2022/01/02/its-a-green-hydrogen-drop-in-the-bucket-but-it-could-still-... |
| 2 | 98159 | World’ s largest floating PV plant goes online in China – pv magazine International | 2022-01-03 | NaN | ['Huaneng Power International has switched on a 320 MW floating PV array in China’ s Shandong pr... | pv-magazine | https://www.pv-magazine.com/2022/01/03/worlds-largest-floating-pv-plant-goes-online-in-china/ |
| 3 | 98158 | Iran wants to deploy 10 GW of renewables over the next four years – pv magazine International | 2022-01-03 | NaN | ['According to the Iranian authorities, there are currently more than 80GW of renewable energy p... | pv-magazine | https://www.pv-magazine.com/2022/01/03/iran-wants-to-deploy-10-gw-of-renewables-over-the-next-fo... |
| 4 | 31128 | Eastern Interconnection Power Grid Said ‘ Being Challenged in New Ways’ | 2022-01-03 | NaN | ['Sign in to get the best natural gas news and data. Follow the topics you want and receive the ... | naturalgasintel | https://www.naturalgasintel.com/eastern-interconnection-power-grid-said-being-challenged-in-new-... |
patent_df.head(3)
| publication_number | application_number | country_code | title_localized | abstract_localized | publication_date | inventor | cpc | |
|---|---|---|---|---|---|---|---|---|
| 0 | US-2022239235-A1 | US-202217717397-A | US | [{'text': 'Adaptable DC-AC Inverter Drive System and Operation', 'language': 'en', 'truncated': ... | [{'text': 'Disclosed is an adaptable DC-AC inverter system and its operation. The system include... | 20220728 | [] | [{'code': 'H02M7/5395', 'inventive': True, 'first': False, 'tree': []}, {'code': 'H02J3/32', 'in... |
| 1 | US-2022239251-A1 | US-202217580956-A | US | [{'text': 'System for providing the energy from a single contiguous solar energy structure to at... | [{'text': 'In accordance with an example embodiment, a solar energy system comprises: a solar en... | 20220728 | [] | [{'code': 'H02S40/38', 'inventive': True, 'first': False, 'tree': []}, {'code': 'G01R22/061', 'i... |
| 2 | EP-4033090-A1 | EP-21152924-A | EP | [{'text': 'Verfahren zum steuern einer windenergieanlage', 'language': 'de', 'truncated': False}... | [{'text': 'Verfahren zum Steuern einer Windenergieanlage (100), wobei die Windenergieanlage (100... | 20220727 | [Schaper, Ulf, von Aswege, Enno, Gerke Funcke, Paul] | [{'code': 'F03D7/0276', 'inventive': True, 'first': True, 'tree': []}, {'code': 'F03D7/0296', 'i... |
Text Preprocessing: Tokenization, Lemmatization & (Optional) Stemming¶
Before we can generate QA pairs, we need to normalize and clean our text. Here’s what happens in this step:
preprocess_textfunction- Converts input to lowercase
- Tokenizes the text into words
- Removes punctuation and stop words
- Applies lemmatization (with optional stemming if
apply_stemming=True)
Applying to Media Articles
tokens_titlefrom the article’stitletokens_contentfrom the article’scontent
Applying to Patent Records
- Extracts the first localized
titleandabstracttext (if present) - Generates
tokens_titleandtokens_abstractcolumns
- Extracts the first localized
By the end of this cell, both articles_df and patent_df have cleaned, tokenized text ready for topic modeling, embedding, and QA‐pair creation.
def preprocess_text(text, apply_stemming=False):
"""
Preprocess a given text by:
- Converting to lowercase
- Tokenizing
- Removing punctuation and stop words
- Applying lemmatization (and optionally stemming)
"""
if not isinstance(text, str):
return []
text = text.lower()
tokens = word_tokenize(text)
tokens = [token for token in tokens if token not in string.punctuation]
tokens = [token for token in tokens if token not in stop_words]
tokens = [lemmatizer.lemmatize(token) for token in tokens]
if apply_stemming:
tokens = [stemmer.stem(token) for token in tokens]
return tokens
articles_df['tokens_title'] = articles_df['title'].apply(preprocess_text)
articles_df['tokens_content'] = articles_df['content'].apply(preprocess_text)
patent_df['tokens_title'] = patent_df['title_localized'].apply(
lambda lst: preprocess_text(lst[0]['text']) if isinstance(lst, list) and lst and 'text' in lst[0] else []
)
patent_df['tokens_abstract'] = patent_df['abstract_localized'].apply(
lambda lst: preprocess_text(lst[0]['text']) if isinstance(lst, list) and lst and 'text' in lst[0] else []
)
Comparing Top Tokens in Media vs. Patent Texts¶
To get a quick sense of how the vocabulary differs between our two domains, we can plot the 15 most frequent tokens in media article content alongside those in patent abstracts. This side-by-side comparison will:
- Highlight domain-specific jargon (e.g., “innovation” vs. “apparatus”).
- Suggest candidate words to add to our custom stop-word list.
- Inform choices around LDA topic modeling parameters.
from collections import Counter
media_tokens = [tok for toks in articles_df['tokens_content'] for tok in toks]
patent_tokens = [tok for toks in patent_df['tokens_abstract'] for tok in toks]
media_top = Counter(media_tokens).most_common(15)
patent_top = Counter(patent_tokens).most_common(15)
m_terms, m_counts = zip(*media_top)
p_terms, p_counts = zip(*patent_top)
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 6))
ax1.barh(m_terms[::-1], m_counts[::-1])
ax1.set_title("Top 15 Tokens in Media Content")
ax1.invert_yaxis()
ax2.barh(p_terms[::-1], p_counts[::-1])
ax2.set_title("Top 15 Tokens in Patent Abstracts")
ax2.invert_yaxis()
plt.tight_layout()
plt.show()
Insights from Token Frequency Comparison¶
The side by side bar charts above reveal clear differences in vocabulary emphasis between media articles and patent abstracts:
Media Content
- Dominated by broad, thematic terms like “energy,” “solar,” and “power.”
- A heavier tail of general language tokens (e.g., punctuation artifacts) suggests we may want to extend our stop-word list before topic modeling.
Patent Abstracts
- Still feature “energy” and “solar,” but with a stronger focus on technical components like “device,” “water,” and “plate.”
- Terms such as “arranged,” “connected,” and “body” reflect procedural descriptions typical of patent language.
Takeaways:
When we will craft the QA Pairs, we should consider leverage these domain keywords to guide prompts e.g., asking about “device configurations” in patents, or “market trends” in media.
Utility & Visualization Functions¶
In this section, we define a suite of helper functions to inspect and visualize our generated QA pairs:
show_questions(title, questions, n)
Quickly prints the firstnquestions under a giventitle, making it easy to spot-check question quality or category samples.gpt4_tokenizer(text)
Leverages the GPT-4 tokenizer (tiktoken) to split any text into model-level tokens, which is useful for estimating cost or prompt length.plot_character_distribution(df)- Plots a histogram (with KDE) of character lengths in the
questioncolumn. - Draws vertical lines for the mean (green) and median (red) to highlight overall length trends.
- Plots a histogram (with KDE) of character lengths in the
plot_word_distribution(df)- Similar to the above, but for word counts (splitting on spaces).
- Helps you gauge how verbose or concise your questions are.
plot_gpt4_token_distribution(df)- Tokenizes all questions via
gpt4_tokenizer, filters to alphabetic tokens, and plots the top 20 most common tokens. - Reveals dominant vocabulary before and after stop-word filtering.
- Tokenizes all questions via
plot_non_stopword_token_distribution(df)- Builds on the GPT-4 token plot but further removes spaCy stop-words.
- Surfaces the most informative, domain-specific tokens used in your questions.
printwrap(text, wrap_length=80)
A simple text-wrapping utility that formats long strings to a specified line width, improving readability in console outputs.
Use these functions throughout the notebook to audit question lengths, token usage, and sample content essential steps for refining prompt strategies and ensuring your QA pairs are well-balanced.
def show_questions(title: str, questions: list, n: int = 6) -> None:
"""
Display the first n questions from the given list with a title.
Parameters:
- title: A string to display as the category title.
- questions: A list of questions.
- n: The maximum number of questions to display (default is 6).
"""
print(title)
print("-" * len(title))
for idx, question in enumerate(questions[:n], start=1):
print(f"{idx}. {question}")
print()
def gpt4_tokenizer(text: str):
"""
Tokenizes the text using the GPT-4 tokenizer from tiktoken.
Returns a list of token strings.
"""
token_ids = encoding.encode(text)
tokens = [encoding.decode([tid]) for tid in token_ids]
return tokens
def plot_character_distribution(df):
"""
Plots a histogram (with KDE) of the character lengths in the 'question' column.
Displays vertical lines for the median and mean character counts.
"""
char_lengths = df["question"].map(len)
sns.histplot(char_lengths, kde=True)
plt.title("Amount of characters in questions")
plt.xlabel("Amount of characters")
plt.ylabel("Number of questions")
median_char_len = char_lengths.median()
mean_char_len = char_lengths.mean()
plt.axvline(median_char_len, color='r', linestyle='--', label=f"Median: {median_char_len:.2f}")
plt.axvline(mean_char_len, color='g', linestyle='--', label=f"Mean: {mean_char_len:.2f}")
plt.legend()
plt.show()
def plot_word_distribution(df):
"""
Plots a histogram (with KDE) of the word counts in the 'question' column.
Displays vertical lines for the median and mean word counts.
"""
word_counts = df["question"].map(lambda x: len(x.split()))
sns.histplot(word_counts, kde=True)
plt.title("Amount of words in questions")
plt.xlabel("Amount of words")
plt.ylabel("Number of questions")
median_word_count = word_counts.median()
mean_word_count = word_counts.mean()
plt.axvline(median_word_count, color='r', linestyle='--', label=f"Median: {median_word_count:.2f}")
plt.axvline(mean_word_count, color='g', linestyle='--', label=f"Mean: {mean_word_count:.2f}")
plt.legend()
plt.show()
def plot_gpt4_token_distribution(df):
"""
Uses the GPT-4 tokenizer to tokenize the 'question' column,
filters out non-alphabetic tokens, counts the top 20 most common tokens,
and displays a horizontal barplot.
"""
questions = df['question'].tolist()
all_tokens = [token for question in questions for token in gpt4_tokenizer(question)]
alpha_tokens = [token.strip().lower() for token in all_tokens if token.strip().isalpha()]
alpha_token_counts = Counter(alpha_tokens)
top_tokens = alpha_token_counts.most_common(20)
if top_tokens:
tokens, counts = zip(*top_tokens)
sns.barplot(x=list(counts), y=list(tokens))
plt.title("Most common alphabetic tokens")
plt.xlabel("Count")
plt.ylabel("Token")
plt.show()
else:
print("No tokens found after filtering.")
def plot_non_stopword_token_distribution(df):
"""
Uses the GPT-4 tokenizer to tokenize the 'question' column,
filters out non-alphabetic tokens and then filters out stopwords using spaCy.
It then counts the top 20 most common non-stopword tokens and displays a horizontal barplot.
"""
questions = df['question'].tolist()
all_tokens = [token for question in questions for token in gpt4_tokenizer(question)]
alpha_tokens = [token.strip().lower() for token in all_tokens if token.strip().isalpha()]
non_stop_tokens = [token for token in alpha_tokens if not nlp.vocab[token].is_stop]
non_stop_token_counts = Counter(non_stop_tokens)
top_tokens = non_stop_token_counts.most_common(20)
if top_tokens:
tokens, counts = zip(*top_tokens)
sns.barplot(x=list(counts), y=list(tokens))
plt.title("Most common non-stopword tokens")
plt.xlabel("Count")
plt.ylabel("Token")
plt.show()
else:
print("No non-stop tokens found after filtering.")
def printwrap(text, wrap_lenght = 80):
wrapped_text = textwrap.fill(text,wrap_lenght)
return wrapped_text
Selection of Important Paragraphs¶
To generate robust QA pairs, we first identified key paragraphs in our dataset. We selected paragraphs based on relevance, informativeness, and their potential to generate high quality questions and answers.
Process Overview:
Domain Filtering: We separated the data into 5 domains: "azocleantech", "solarindustrymag", "cleantechnica", "pv-magazine", and "energyvoice", ensuring relevance to our project.
Manual Selection by Index: After filtering by domain, we manually selected high quality paragraphs based on content. For example, we selected 7 paragraphs from "azocleantech" and 8 from "solarindustrymag", saving them in a CSV file. These selections cover diverse cleantech topics to ensure varied input for QA generation.
Language Filter: We focused exclusively on English examples for consistency with the tutorial's approach. For more details, refer to the Cleantech RAG Tutorial.
azocleantech_df = articles_df[articles_df['domain']=='azocleantech']
solarindustrymag_df = articles_df[articles_df['domain']=='solarindustrymag']
pv_magazine_df = articles_df[articles_df['domain']=='pv-magazine']
energyvoice_df = articles_df[articles_df['domain']=='energyvoice']
cleantechnica_df = articles_df[articles_df['domain']=='cleantechnica']
Paragraphs for azocleantech domain¶
# 7 paragraphs of azocleantech_df
selected_indices = [17690, 17915, 18349, 17715, 19247, 15788, 2195]
paragraphs_azocleantech_df = azocleantech_df.loc[selected_indices]
paragraphs_azocleantech_df.head(2)
| Unnamed: 0 | title | date | author | content | domain | url | tokens_title | tokens_content | |
|---|---|---|---|---|---|---|---|---|---|
| 17690 | 23302 | Renewable Energy Partnerships: Green Hydrogen Integration | 2024-05-01 | NaN | ['As we move toward a sustainable future, green hydrogen is a significant player in the renewabl... | azocleantech | https://www.azocleantech.com/article.aspx?ArticleID=1823 | [renewable, energy, partnership, green, hydrogen, integration] | ['as, move, toward, sustainable, future, green, hydrogen, significant, player, renewable, energy... |
| 17915 | 23338 | Asahi Kasei Starts Operation of Multi-Module Hydrogen Pilot Plant in Kawasaki | 2024-05-15 | NaN | ['The Japanese technology company Asahi Kasei is further accelerating its hydrogen business acti... | azocleantech | https://www.azocleantech.com/news.aspx?newsID=34871 | [asahi, kasei, start, operation, multi-module, hydrogen, pilot, plant, kawasaki] | ['the, japanese, technology, company, asahi, kasei, accelerating, hydrogen, business, activity, ... |
Paragraphs for solarindustrymag domain¶
# 8 paragraphs of solarindustrymag_df
selected_indices = [14,4307,1513,2216,13545,13382,2344,1578]
paragraphs_solarindustrymag_df = solarindustrymag_df.loc[selected_indices]
paragraphs_solarindustrymag_df.head(2)
| Unnamed: 0 | title | date | author | content | domain | url | tokens_title | tokens_content | |
|---|---|---|---|---|---|---|---|---|---|
| 14 | 63200 | NREL Project Investigates Wind Condition Impacts on Solar Power Structures | 2022-01-04 | NaN | ['High wind loads increase structural design costs of concentrating solar power ( CSP) collector... | solarindustrymag | https://solarindustrymag.com/nrel-project-investigates-wind-condition-impacts-on-solar-power-str... | [nrel, project, investigates, wind, condition, impact, solar, power, structure] | ['high, wind, load, increase, structural, design, cost, concentrating, solar, power, csp, collec... |
| 4307 | 63461 | Energy Vault Deploying 440 MWh Nevada Energy Storage System for NV Energy | 2022-12-12 | NaN | ['NV Energy, Nevada’ s largest public utility, has awarded Energy Vault Holdings Inc. with a pro... | solarindustrymag | https://solarindustrymag.com/energy-vault-deploys-440-mwh-nevada-energy-storage-system-for-nv-en... | [energy, vault, deploying, 440, mwh, nevada, energy, storage, system, nv, energy] | ['nv, energy, nevada, ’, largest, public, utility, awarded, energy, vault, holding, inc., projec... |
Paragraphs for pv-magzine domain¶
pv_magazine_indices=[1311, 1416, 14590, 16840, 17599, 17643, 18321, 18854, 18906, 19194]
paragraphs_pv_magazine_df = pv_magazine_df.loc[pv_magazine_indices]
paragraphs_pv_magazine_df.head(2)
| Unnamed: 0 | title | date | author | content | domain | url | tokens_title | tokens_content | |
|---|---|---|---|---|---|---|---|---|---|
| 1311 | 98356 | South Korea tests photovoltaics on railroad noise barriers – pv magazine International | 2022-04-21 | NaN | ['Land-scarce South Korea is currently hosting a series of initiatives aimed at deploying solar ... | pv-magazine | https://www.pv-magazine.com/2022/04/21/south-korea-tests-photovoltaics-on-railroad-noise-barriers/ | [south, korea, test, photovoltaics, railroad, noise, barrier, –, pv, magazine, international] | ['land-scarce, south, korea, currently, hosting, series, initiative, aimed, deploying, solar, un... |
| 1416 | 98368 | Quebec publishes draft documents for 1.3 GW tender – pv magazine International | 2022-04-27 | NaN | ['The Canadian provincial government’ s Green Economy Plan, launched in November 2020, envisages... | pv-magazine | https://www.pv-magazine.com/2022/04/27/quebec-publishes-draft-documents-for-1-3-gw-tender/ | [quebec, publishes, draft, document, 1.3, gw, tender, –, pv, magazine, international] | ['the, canadian, provincial, government, ’, green, economy, plan, launched, november, 2020, envi... |
Paragraphs for energyvoice domain¶
energyvoice_indices=[870, 2204, 2979, 3255, 4605, 5268, 6320, 12599, 12647, 18538]
paragraphs_energyvoice_df = energyvoice_df.loc[energyvoice_indices]
paragraphs_energyvoice_df.head(2)
| Unnamed: 0 | title | date | author | content | domain | url | tokens_title | tokens_content | |
|---|---|---|---|---|---|---|---|---|---|
| 870 | 42863 | Patrick Harvie: Ukraine invasion doesn't justify North Sea production boost | 2022-03-14 | NaN | ['Scottish Green party co-leader Patrick Harvie has said the war in Ukraine must not be used to ... | energyvoice | https://www.energyvoice.com/oilandgas/europe/395102/north-sea-patrick-harvie/ | [patrick, harvie, ukraine, invasion, n't, justify, north, sea, production, boost] | ['scottish, green, party, co-leader, patrick, harvie, said, war, ukraine, must, used, justify, i... |
| 2204 | 42960 | Port of Nigg’ s new quay ready to play leading role in renewables boom | 2022-07-05 | NaN | ['Port of Nigg has unveiled its new East Quay following a multimillion-pound investment to incre... | energyvoice | https://www.energyvoice.com/renewables-energy-transition/424759/port-of-niggs-new-quay-ready-to-... | [port, nigg, ’, new, quay, ready, play, leading, role, renewables, boom] | ['port, nigg, unveiled, new, east, quay, following, multimillion-pound, investment, increase, ca... |
Paragraphs for cleantechnica domain¶
filtered_indices = cleantechnica_df.index[cleantechnica_df["Unnamed: 0"].isin(
[95438, 95392, 95363, 95261, 95173, 95163, 95133, 95106,
95074, 94927, 94925, 94662, 94668, 94645, 95411]
)].tolist()
paragraphs_cleantechnica_df = cleantechnica_df.loc[filtered_indices]
paragraphs_cleantechnica_df.head(2)
| Unnamed: 0 | title | date | author | content | domain | url | tokens_title | tokens_content | |
|---|---|---|---|---|---|---|---|---|---|
| 7568 | 94645 | Cruise Robotaxis Entering More Cities | 2023-05-17 | NaN | ['Cruise has slowly been expanding its network of robotaxis and their availability in certain ma... | cleantechnica | https://cleantechnica.com/2023/05/17/cruise-robotaxis-entering-more-cities/ | [cruise, robotaxis, entering, city] | ['cruise, slowly, expanding, network, robotaxis, availability, certain, market, though, ’, long,... |
| 7619 | 94668 | Philippine Electric Vehicle Association Receives Award, Attends EV Meetings in Korea | 2023-05-21 | NaN | ['JEJU, SOUTH KOREA — EVAP, the Electric Vehicle Association of the Philippines, recently partic... | cleantechnica | https://cleantechnica.com/2023/05/21/philippine-electric-vehicle-association-receives-awards-att... | [philippine, electric, vehicle, association, receives, award, attends, ev, meeting, korea] | ['jeju, south, korea, —, evap, electric, vehicle, association, philippine, recently, participate... |
Merging Paragraphs from Multiple Media Sources¶
To build a comprehensive corpus for QA generation and topic analysis, we combine the paragraph level DataFrames from above into one unified DataFrame
combined_df = pd.concat([paragraphs_azocleantech_df,
paragraphs_solarindustrymag_df,
paragraphs_pv_magazine_df,
paragraphs_energyvoice_df,
paragraphs_cleantechnica_df],
ignore_index=True)
data_folder = Path("./data")
if not data_folder.exists():
data_folder.mkdir()
combined_df.to_csv("data/paragraphs_media_dataset.csv")
combined_df.head(3)
| Unnamed: 0 | title | date | author | content | domain | url | tokens_title | tokens_content | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 23302 | Renewable Energy Partnerships: Green Hydrogen Integration | 2024-05-01 | NaN | ['As we move toward a sustainable future, green hydrogen is a significant player in the renewabl... | azocleantech | https://www.azocleantech.com/article.aspx?ArticleID=1823 | [renewable, energy, partnership, green, hydrogen, integration] | ['as, move, toward, sustainable, future, green, hydrogen, significant, player, renewable, energy... |
| 1 | 23338 | Asahi Kasei Starts Operation of Multi-Module Hydrogen Pilot Plant in Kawasaki | 2024-05-15 | NaN | ['The Japanese technology company Asahi Kasei is further accelerating its hydrogen business acti... | azocleantech | https://www.azocleantech.com/news.aspx?newsID=34871 | [asahi, kasei, start, operation, multi-module, hydrogen, pilot, plant, kawasaki] | ['the, japanese, technology, company, asahi, kasei, accelerating, hydrogen, business, activity, ... |
| 2 | 23355 | Global Warming Threatens Hudson Bay Polar Bears | 2024-06-14 | NaN | ['Recent research by Julienne Stroeve and colleagues, published in Communications Earth & Enviro... | azocleantech | https://www.azocleantech.com/news.aspx?newsID=34952 | [global, warming, threatens, hudson, bay, polar, bear] | ['recent, research, julienne, stroeve, colleague, published, communication, earth, environment, ... |
3. QA Pair Generation¶
We use Llama 3.2 Instruct model and Deepseek R1 to generate QA pairs from selected paragraphs. Each generated question and answer is tied to specific cleantech-related content, providing context for later analysis.
For the media dataset, the QA pairs include:
- question: The generated question.
- relevant_text: The text (paragraph) used as context.
- answer: The generated answer.
- article_url: The URL of the source article.
For the patent dataset, the QA pairs contain additional metadata:
- question: The generated question.
- relevant_text: The abstract used as context.
- answer: The generated answer.
- title: The patent title.
- publication_date: The patent publication date.
- inventor: The inventor(s) associated with the patent.
Note: Generating hundreds of QA pairs with Llama 3.2 Instruct and Deepseek R1 can be very time consuming, so we’ve provided the completed CSVs here. In th cell above, we simply load the media QA file and inspect their structure
QA Pairs with Llama 3.2¶
Generate the QA Pairs for the Media Dataset¶
from openai import OpenAI
import pandas as pd
import re
def generate_qa_pairs_media(df, target_qa_pairs=43, dataset_type="media",model = "meta-llama/llama-3.2-3b-instruct"):
"""
Generates QA pairs from a given DataFrame using Llama 3.2 Instruct.
Parameters:
df (pd.DataFrame): DataFrame containing the reference texts.
For the media dataset, it is expected to have a 'content' column and 'url'.
For the patent dataset, it is expected to have an 'abstract' column, 'title',
'publication_date', and 'inventor'.
target_qa_pairs (int): The target number of QA pairs to generate.
dataset_type (str): Specifies the dataset type. Use "media" for the media dataset,
and "patent" for the patent dataset.
Returns:
pd.DataFrame: DataFrame containing the generated QA pairs and the associated metadata.
"""
qa_pairs = []
for _, row in df.iterrows():
if dataset_type.lower() == "media":
context_text = row['content']
elif dataset_type.lower() == "patent":
context_text = row['abstract']
else:
raise ValueError("dataset_type must be either 'media' or 'patent'.")
paragraphs = context_text.split("\n\n")
for paragraph in paragraphs:
if not paragraph.strip():
continue
prompt = (
"Using the following text, generate 15 unique and meaningful question–answer pairs.\n\n"
"Instructions:\n"
"1. Questions should be varied: include what, who, when, where, why, and how as appropriate.\n"
"2. Focus on important details, facts, or concepts mentioned in the text.\n"
"3. Provide clear answers that explain the key idea without being overly brief or too long.\n"
"4. Format each pair as follows:\n"
"Q: [Your question here]\n"
"A: [Your answer here]\n\n"
"Separate each Q&A pair with a blank line.\n\n"
f"Text:\n{paragraph.strip()}"
)
try:
completion = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}],
)
response = completion.choices[0].message.content.strip()
# Extract Q&A pairs from the response
qa_text = re.findall(r"Q:\s*(.*?)\s*A:\s*(.*?)(?=\nQ:|\Z)", response, re.DOTALL)
for question, answer in qa_text:
if not any(existing["question"] == question for existing in qa_pairs):
if dataset_type.lower() == "media":
qa_pair = {
"question": question.strip(),
"relevant_text": paragraph.strip(),
"answer": answer.strip(),
"article_url": row.get('url', '')
}
else:
qa_pair = {
"question": question.strip(),
"relevant_text": paragraph.strip(),
"answer": answer.strip(),
"title": row.get('title', ''),
"publication_date": row.get('publication_date', ''),
"inventor": row.get('inventor', '')
}
qa_pairs.append(qa_pair)
if len(qa_pairs) >= target_qa_pairs:
break
except Exception as e:
print(f"Error: {e}")
continue
if len(qa_pairs) >= target_qa_pairs:
break
return pd.DataFrame(qa_pairs)
media_qa_df_llama = generate_qa_pairs_media(combined_df, target_qa_pairs=800, dataset_type="media")
media_qa_df_llama.to_csv("data/media_qa_df_llama.csv", index= False)
Since generating QA pairs can be very time consuming, we’ve pre-created and saved the media QA DataFrame to datasets/media_qa_df_llama.csv. Here, we simply load it and take a quick peek at the first few entries:
media_qa_df_llama = pd.read_csv("datasets/media_qa_df_llama.csv")
media_qa_df_llama.head(5)[["question","answer"]]
| question | answer | |
|---|---|---|
| 0 | What is green hydrogen, and how is it produced? | Green hydrogen is a sustainable energy carrier produced by water electrolysis using renewable en... |
| 1 | Who are some of the key players involved in the development and adoption of green hydrogen? | Key players include companies like Siemens Energy, ITM Power, and H2Pro, as well as academia and... |
| 2 | What are the main challenges associated with integrating green hydrogen into the energy grid? | The main challenges include technical hurdles such as lower energy efficiency, safe storage and ... |
| 3 | What is the role of the European Union's Hydrogen Strategy in promoting green hydrogen adoption? | The strategy aims to significantly ramp up green hydrogen production by 2030, providing financia... |
| 4 | How does the Hydrogen Council facilitate partnerships and collaboration among companies, governm... | The Hydrogen Council brings together over 80 energy, transport, and industrial companies worldwi... |
Generate the QA Pairs for the Patent Dataset¶
For the patent dataset, we begin by filtering the articles based on keywords relevant to cleantech (using two keyword sets for azocleantech and solarindustrymag). Patents do not come pre-categorized into domains like media articles, so we must generate QA pairs first and then apply a separate categorization process.
To extract patents relevant to cleantech topics, we use a deterministic, rule-based approach.
- We start by defining two lists of keywords:
- Next, we apply these lists to the patent dataset by examining the abstract field of each patent. Using a lambda function with the apply method, we deterministically check if any keyword exists in an abstract. This yields two Boolean masks indicating whether a patent belongs to the azocleantech or solarindustrymag domain, respectively. We then filter the dataset based on these masks.
- We decided to use a deterministic, rule-based approach because we can directly see which keywords triggered the classification, allowing us to easily audit and refine our keyword lists if needed.
- However, we are aware that this method is rigid, so if an abstract uses synonyms or slightly different phrasing that isn’t in our keyword list, it might not be classified correctly. The approach does not consider context or the possibility of words having multiple meanings. An abstract might mention a keyword in a context that does not pertain to cleantech.
By using this deterministic approach, we ensure that our patent dataset is filtered in a consistent and transparent manner. While this method is efficient and straightforward, it does have limitations regarding flexibility and nuance.
# patent_df_related = pd.read_csv('datasets/patent_df_selected_abstracts.csv')
patent_df.head(3)
| publication_number | application_number | country_code | title_localized | abstract_localized | publication_date | inventor | cpc | tokens_title | tokens_abstract | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | US-2022239235-A1 | US-202217717397-A | US | [{'text': 'Adaptable DC-AC Inverter Drive System and Operation', 'language': 'en', 'truncated': ... | [{'text': 'Disclosed is an adaptable DC-AC inverter system and its operation. The system include... | 20220728 | [] | [{'code': 'H02M7/5395', 'inventive': True, 'first': False, 'tree': []}, {'code': 'H02J3/32', 'in... | [adaptable, dc-ac, inverter, drive, system, operation] | [disclosed, adaptable, dc-ac, inverter, system, operation, system, includes, multiple, dc, input... |
| 1 | US-2022239251-A1 | US-202217580956-A | US | [{'text': 'System for providing the energy from a single contiguous solar energy structure to at... | [{'text': 'In accordance with an example embodiment, a solar energy system comprises: a solar en... | 20220728 | [] | [{'code': 'H02S40/38', 'inventive': True, 'first': False, 'tree': []}, {'code': 'G01R22/061', 'i... | [system, providing, energy, single, contiguous, solar, energy, structure, least, two, different,... | [accordance, example, embodiment, solar, energy, system, comprises, solar, energy, structure, co... |
| 2 | EP-4033090-A1 | EP-21152924-A | EP | [{'text': 'Verfahren zum steuern einer windenergieanlage', 'language': 'de', 'truncated': False}... | [{'text': 'Verfahren zum Steuern einer Windenergieanlage (100), wobei die Windenergieanlage (100... | 20220727 | [Schaper, Ulf, von Aswege, Enno, Gerke Funcke, Paul] | [{'code': 'F03D7/0276', 'inventive': True, 'first': True, 'tree': []}, {'code': 'F03D7/0296', 'i... | [verfahren, zum, steuern, einer, windenergieanlage] | [verfahren, zum, steuern, einer, windenergieanlage, 100, wobei, die, windenergieanlage, 100, ein... |
Extracting & Combining Patent Text Fields¶
In this step, we unpack the localized title and abstract lists to create usable text and metadata columns:
Language Extraction
title_lang: the language code of the first entry intitle_localized.abstract_lang: likewise forabstract_localized.
Text Extraction
title_text: the actual title string fromtitle_localized.abstract_text: the abstract text fromabstract_localized.
Concatenate & Word Count
total: joinstitle_textandabstract_textinto a single field for each patent.w_count: counts the words intotal, giving us a quick measure of document length for downstream filtering or sampling.
patent_df['title_lang'] = patent_df['title_localized'].apply(lambda x:x[0]['language'])
patent_df['abstract_lang'] = patent_df['abstract_localized'].apply(lambda x:x[0]['language'])
patent_df['title_text'] = patent_df['title_localized'].apply(lambda x:x[0]['text'])
patent_df['abstract_text'] = patent_df['abstract_localized'].apply(lambda x:x[0]['text'])
patent_df['total'] = patent_df['title_text']+patent_df['abstract_text']
patent_df['w_count'] = patent_df['total'].apply(lambda x:len(x.split()))
patent_df.head(2)
| publication_number | application_number | country_code | title_localized | abstract_localized | publication_date | inventor | cpc | tokens_title | tokens_abstract | title_lang | abstract_lang | title_text | abstract_text | total | w_count | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | US-2022239235-A1 | US-202217717397-A | US | [{'text': 'Adaptable DC-AC Inverter Drive System and Operation', 'language': 'en', 'truncated': ... | [{'text': 'Disclosed is an adaptable DC-AC inverter system and its operation. The system include... | 20220728 | [] | [{'code': 'H02M7/5395', 'inventive': True, 'first': False, 'tree': []}, {'code': 'H02J3/32', 'in... | [adaptable, dc-ac, inverter, drive, system, operation] | [disclosed, adaptable, dc-ac, inverter, system, operation, system, includes, multiple, dc, input... | en | en | Adaptable DC-AC Inverter Drive System and Operation | Disclosed is an adaptable DC-AC inverter system and its operation. The system includes multiple ... | Adaptable DC-AC Inverter Drive System and OperationDisclosed is an adaptable DC-AC inverter syst... | 70 |
| 1 | US-2022239251-A1 | US-202217580956-A | US | [{'text': 'System for providing the energy from a single contiguous solar energy structure to at... | [{'text': 'In accordance with an example embodiment, a solar energy system comprises: a solar en... | 20220728 | [] | [{'code': 'H02S40/38', 'inventive': True, 'first': False, 'tree': []}, {'code': 'G01R22/061', 'i... | [system, providing, energy, single, contiguous, solar, energy, structure, least, two, different,... | [accordance, example, embodiment, solar, energy, system, comprises, solar, energy, structure, co... | en | en | System for providing the energy from a single contiguous solar energy structure to at least two ... | In accordance with an example embodiment, a solar energy system comprises: a solar energy struct... | System for providing the energy from a single contiguous solar energy structure to at least two ... | 109 |
To understand the range of patent text sizes we’ll be working with, we plot a histogram (with KDE) of the combined word counts for each patent’s title + abstract
plt.figure(figsize=(10, 6))
sns.histplot(patent_df['w_count'], kde=True)
plt.title('Distribution of Word Counts in Patents')
plt.xlabel('Word Count')
plt.ylabel('Frequency')
plt.show()
There’s a large spike at very low word counts (near zero), indicating some records with missing or extremely short abstracts/titles.
The bulk of patents fall in the 150–250 word range, reflecting typical abstract + title lengths.
A long tail beyond 300 words suggests a few particularly verbose patents.
We want to filter out extremely short records (e.g., w_count < 20) to avoid generating trivial or uninformative QA pairs.
Language Distribution in Patent Titles vs. Abstracts¶
Here we check which language combinations appear in our patent records
display(patent_df[['title_lang','abstract_lang']].value_counts())
title_lang abstract_lang
en en 8297
zh zh 7741
en zh 6624
zh en 6463
fr en 226
de en 128
en fr 126
fr fr 64
de de 38
en ko 34
ko ko 29
fr de 28
de fr 28
ru ru 26
fr ko 24
en de 22
ru en 20
zh fr 18
en ru 14
fr zh 12
ar ar 6
es fr 6
en es 6
es en 6
en hr 4
ar 4
he en 2
en pl 2
sr sr 2
Name: count, dtype: int64
lang_pairs = (
patent_df[['title_lang', 'abstract_lang']]
.value_counts()
.reset_index(name='count')
.assign(pair=lambda df: df['title_lang'] + '→' + df['abstract_lang'])
)
plt.figure(figsize=(8, 6))
sns.barplot(
data=lang_pairs.nlargest(10, 'count'),
x='count',
y='pair'
)
plt.title("Top 10 Patent Title→Abstract Language Pairs")
plt.xlabel("Number of Patents")
plt.ylabel("Language Pair")
plt.tight_layout()
plt.show()
Filtering to English & Minimum Length¶
To focus on fully English patents with substantive content, we filter out any records where either the title or abstract isn’t in English, and where the combined text is too short to yield meaningful Q&A. Specifically:
- Keep only patents with
title_lang == 'en'andabstract_lang == 'en'. - Exclude any patents whose
w_count(title + abstract word count) is ≤ 20.
This reduces our set to 15,673 out of the original ~30,000 patents:
clean_patent_df = patent_df[(patent_df['title_lang'] == 'en')*(patent_df['abstract_lang'] == 'en')*(patent_df['w_count']>20)]
len(clean_patent_df)
8296
plt.figure(figsize=(10, 6))
sns.kdeplot(patent_df['w_count'], label='Original', bw_adjust=1)
sns.kdeplot(clean_patent_df['w_count'], label='Filtered (w_count>20)', bw_adjust=1)
plt.title("Patent Word-Count Distribution: Original vs. Filtered")
plt.xlabel("Word Count")
plt.ylabel("Density")
plt.legend()
plt.show()
The original distribution (blue) has a sharp peak near zero due to very short or empty abstracts.
After filtering (orange), that peak disappears, and the bulk of patents centers around 150–250 words.
This confirms we’ve removed trivial entries while retaining the core body of substantive patents ideal for generating meaningful QA pairs.
Topic Modeling on Cleaned Patent Texts¶
Now that we have tokenized and preprocessed our patent texts, we’ll use Gensim’s LDA to surface the main themes across patents:
Prepare the Data
- Apply our
preprocess_textfunction to the combinedtitle + abstract(total) field, storing the result inclean_tokens. - Collect these token lists as
processed_docs.
- Apply our
Build Dictionary & Corpus
dictionary = corpora.Dictionary(processed_docs)creates a mapping of each unique token to an integer ID.corpus = [dictionary.doc2bow(doc) for doc in processed_docs]converts each document into a bag-of-words (token ID, count) format.
Train the LDA Model
were we choose the top 10 topics
lda_model = gensim.models.LdaMulticore(
corpus=corpus, id2word=dictionary, num_topics=10, # we choose 10 topics passes=10, # 10 full corpus passes workers=4 # parallelize across 4 CPU cores)
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from gensim import corpora
clean_patent_df['clean_tokens'] = clean_patent_df['total'].apply(preprocess_text)
processed_docs = clean_patent_df['clean_tokens']
# create a dictionary and corpus
dictionary = corpora.Dictionary(processed_docs)
corpus = [dictionary.doc2bow(doc) for doc in processed_docs]
# apply LDA topic modeling
lda_model = gensim.models.LdaMulticore(corpus=corpus, id2word=dictionary, num_topics=10, passes=10, workers=4)
for idx, topic in lda_model.print_topics(-1):
print(f"Topic: {idx} \nWords: {topic}\n")
# find the dominant topic for each document
for index, row in clean_patent_df.iterrows():
bow = dictionary.doc2bow(row['clean_tokens'])
dominant_topic = max(lda_model[bow], key=lambda item: item[1])
Topic: 0 Words: 0.026*"solar" + 0.025*"energy" + 0.015*"panel" + 0.011*"power" + 0.010*"water" + 0.009*"support" + 0.009*"system" + 0.009*"connected" + 0.008*"model" + 0.008*"heat" Topic: 1 Words: 0.017*"water" + 0.011*"energy" + 0.009*"comprises" + 0.009*"structure" + 0.009*"arranged" + 0.008*"end" + 0.008*"first" + 0.008*"photovoltaic" + 0.007*"solar" + 0.007*"mechanism" Topic: 2 Words: 0.032*"solar" + 0.031*"energy" + 0.015*"power" + 0.014*"storage" + 0.012*"water" + 0.011*"system" + 0.011*"device" + 0.010*"arranged" + 0.010*"panel" + 0.009*"model" Topic: 3 Words: 0.027*"energy" + 0.024*"heat" + 0.023*"solar" + 0.015*"hydrogen" + 0.015*"system" + 0.013*"water" + 0.012*"pipe" + 0.008*"body" + 0.008*"device" + 0.008*"connected" Topic: 4 Words: 0.020*"power" + 0.018*"device" + 0.016*"body" + 0.015*"solar" + 0.014*"module" + 0.014*"energy" + 0.013*"system" + 0.009*"connected" + 0.009*"control" + 0.008*"water" Topic: 5 Words: 0.013*"water" + 0.013*"device" + 0.012*"solar" + 0.011*"energy" + 0.010*"unit" + 0.008*"connected" + 0.008*"plate" + 0.008*"mechanism" + 0.008*"groove" + 0.007*"side" Topic: 6 Words: 0.022*"energy" + 0.018*"solar" + 0.014*"photovoltaic" + 0.013*"heat" + 0.012*"device" + 0.011*"air" + 0.009*"power" + 0.009*"system" + 0.008*"control" + 0.007*"model" Topic: 7 Words: 0.014*"energy" + 0.014*"solar" + 0.014*"module" + 0.013*"layer" + 0.011*"first" + 0.009*"second" + 0.009*"system" + 0.009*"method" + 0.009*"plate" + 0.008*"one" Topic: 8 Words: 0.021*"energy" + 0.016*"power" + 0.015*"solar" + 0.014*"arranged" + 0.013*"device" + 0.013*"wind" + 0.012*"heat" + 0.012*"water" + 0.011*"side" + 0.010*"connected" Topic: 9 Words: 0.018*"plate" + 0.016*"solar" + 0.014*"energy" + 0.013*"water" + 0.011*"first" + 0.011*"body" + 0.010*"side" + 0.010*"connected" + 0.009*"model" + 0.009*"power"
def get_top_words_and_docs(lda_model, corpus, dictionary, num_topics=10, top_words=10, top_docs=5):
topic_results = []
for topic_idx in range(num_topics):
# get top words for the topic
top_words_for_topic = [word for word, prob in lda_model.show_topic(topic_idx, top_words)]
# get the documents with the highest probability for this topic
topic_document_probs = [] # list to store document probabilities for this topic
for doc_idx, doc_bow in enumerate(corpus):
for topic_prob_idx, topic_prob in lda_model[doc_bow]:
if topic_prob_idx == topic_idx:
topic_document_probs.append((doc_idx, topic_prob))
break # move on to next document
topic_document_probs.sort(key = lambda x: x[1], reverse=True) # sort in descending order based on probability
top_docs_for_topic = [doc_idx for doc_idx, _ in topic_document_probs[:top_docs]]
topic_results.append({
'topic_id': topic_idx,
'top_words': top_words_for_topic,
'top_documents': top_docs_for_topic
})
return topic_results
topic_results = get_top_words_and_docs(lda_model, corpus, dictionary)
document_indices = []
for topic in topic_results:
print(f"Topic {topic['topic_id']}:")
print("Top Words:", topic['top_words'])
print("Top Documents:", topic['top_documents'])
document_indices.extend(topic['top_documents'])
print("-"*20)
Topic 0: Top Words: ['solar', 'energy', 'panel', 'power', 'water', 'support', 'system', 'connected', 'model', 'heat'] Top Documents: [5300, 5303, 5304, 5509, 5511] -------------------- Topic 1: Top Words: ['water', 'energy', 'comprises', 'structure', 'arranged', 'end', 'first', 'photovoltaic', 'solar', 'mechanism'] Top Documents: [482, 530, 552, 4788, 1130] -------------------- Topic 2: Top Words: ['solar', 'energy', 'power', 'storage', 'water', 'system', 'device', 'arranged', 'panel', 'model'] Top Documents: [5572, 5569, 5588, 662, 3535] -------------------- Topic 3: Top Words: ['energy', 'heat', 'solar', 'hydrogen', 'system', 'water', 'pipe', 'body', 'device', 'connected'] Top Documents: [24, 25, 16, 2043, 2083] -------------------- Topic 4: Top Words: ['power', 'device', 'body', 'solar', 'module', 'energy', 'system', 'connected', 'control', 'water'] Top Documents: [2528, 2537, 1474, 1703, 3423] -------------------- Topic 5: Top Words: ['water', 'device', 'solar', 'energy', 'unit', 'connected', 'plate', 'mechanism', 'groove', 'side'] Top Documents: [7535, 7537, 7542, 1844, 1848] -------------------- Topic 6: Top Words: ['energy', 'solar', 'photovoltaic', 'heat', 'device', 'air', 'power', 'system', 'control', 'model'] Top Documents: [6703, 6707, 6716, 6731, 7957] -------------------- Topic 7: Top Words: ['energy', 'solar', 'module', 'layer', 'first', 'second', 'system', 'method', 'plate', 'one'] Top Documents: [5445, 5446, 6361, 6362, 6363] -------------------- Topic 8: Top Words: ['energy', 'power', 'solar', 'arranged', 'device', 'wind', 'heat', 'water', 'side', 'connected'] Top Documents: [5618, 5622, 61, 73, 6654] -------------------- Topic 9: Top Words: ['plate', 'solar', 'energy', 'water', 'first', 'body', 'side', 'connected', 'model', 'power'] Top Documents: [5902, 5903, 5904, 7391, 7392] --------------------
patent_df_selected_abstracts = clean_patent_df.iloc[document_indices]
patent_df_selected_abstracts.head(2)
| publication_number | application_number | country_code | title_localized | abstract_localized | publication_date | inventor | cpc | tokens_title | tokens_abstract | title_lang | abstract_lang | title_text | abstract_text | total | w_count | clean_tokens | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 17713 | CN-113486582-B | CN-202110760093-A | CN | [{'text': 'Modeling and speed regulation parameter optimization method and system for hydropower... | [{'text': 'The invention discloses a hydropower station modeling and speed regulation parameter ... | 20220218 | [] | [{'code': 'G06F2119/14', 'inventive': False, 'first': False, 'tree': []}, {'code': 'G06F2113/08'... | [modeling, speed, regulation, parameter, optimization, method, system, hydropower, station, doub... | [invention, discloses, hydropower, station, modeling, speed, regulation, parameter, optimization... | en | en | Modeling and speed regulation parameter optimization method and system for hydropower station wi... | The invention discloses a hydropower station modeling and speed regulation parameter optimizatio... | Modeling and speed regulation parameter optimization method and system for hydropower station wi... | 253 | [modeling, speed, regulation, parameter, optimization, method, system, hydropower, station, doub... |
| 17716 | CN-113486582-B | CN-202110760093-A | CN | [{'text': 'Modeling and speed regulation parameter optimization method and system for hydropower... | [{'text': 'The invention discloses a hydropower station modeling and speed regulation parameter ... | 20220218 | [] | [{'code': 'G06F2119/14', 'inventive': False, 'first': False, 'tree': []}, {'code': 'G06F2113/08'... | [modeling, speed, regulation, parameter, optimization, method, system, hydropower, station, doub... | [invention, discloses, hydropower, station, modeling, speed, regulation, parameter, optimization... | en | en | Modeling and speed regulation parameter optimization method and system for hydropower station wi... | The invention discloses a hydropower station modeling and speed regulation parameter optimizatio... | Modeling and speed regulation parameter optimization method and system for hydropower station wi... | 253 | [modeling, speed, regulation, parameter, optimization, method, system, hydropower, station, doub... |
patent_df_selected_abstracts.to_csv('data/patent_df_selected_abstracts.csv',index=False)
patent_df_selected_abstracts = pd.read_csv("datasets/patent_df_selected_abstracts.csv")
Generating QA Pairs from Patent Abstracts¶
This function uses the Llama 3.2 Instruct model to extract exactly three question–answer pairs per patent abstract, returning them in a structured JSON format:
extract_json_from_text(text)
Uses a regex to pull out the JSON array (between[and]) from the model’s response.generate_qa_pairs_patent(df, target_qa_pairs, model)- Iterates over the patent DataFrame until
target_qa_pairsare collected. - For each patent’s combined text (
total), it sends a prompt asking for three Q&A pairs. - Parses the JSON reply, logs any extraction or parsing errors, and collects each
{question, answer, relevant_text, title, publication_date, inventor}into a list. - Returns a pandas DataFrame of all generated QA pairs.
- Iterates over the patent DataFrame until
import json
import re
def extract_json_from_text(text):
# try to find the JSON block (between the first [ and last ])
match = re.search(r"\[\s*{.*?}\s*]", text, re.DOTALL)
if match:
return match.group(0)
else:
return None
def generate_qa_pairs_patent(df, target_qa_pairs=43,model = "meta-llama/llama-3.2-3b-instruct"):
qa_pairs = []
for _ in range(target_qa_pairs):
for _, row in tqdm(df.iterrows()):
abstract = row['total']
prompt = f"""
You are a helpful assistant. Extract exactly 3 useful and meaningful question and answer pairs from the following text:
{abstract}
Return your answer in this JSON format:
[
{{
"question": "Your first question?",
"answer": "The answer to the first question."
}},
{{
"question": "Your second question?",
"answer": "The answer to the second question."
}},
{{
"question": "Your third question?",
"answer": "The answer to the third question."
}}
]
"""
# try:
response = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}]
)
response_text = response.choices[0].message.content
json_block = extract_json_from_text(response_text)
if not json_block:
print("❌ Could not find JSON in response:")
print(response_text)
continue
try:
qa_list = json.loads(json_block)
for qa in qa_list:
qa_pairs.append({
"question": qa["question"],
"answer": qa["answer"],
"relevant_text": abstract,
"title": row["title_text"],
"publication_date": row["publication_date"],
"inventor": row["inventor"]
})
except json.JSONDecodeError:
print("❌ JSON parsing failed. Here's what the model returned:")
print(json_block)
continue
return pd.DataFrame(qa_pairs)
patent_df_related = pd.read_csv('data/patent_df_selected_abstracts.csv')
patents_qa_df_llama = pd.read_csv("datasets/patents_qa_df_510_llama_new.csv")
patents_qa_df_llama = generate_qa_pairs_patent(patent_df_related, target_qa_pairs=4)
❌ Could not find JSON in response:
1. [
2. {
3. "question": "What is the purpose of the inlet screen in a hydropower plant?",
4. "answer": "The inlet screen is adapted to be arranged at the water inlet of a hydropower plant to filter debris and other materials from entering the system, while also being equipped with heating means to prevent ice formation."
5. },
6. {
7. "question": "How does the heating mechanism work in the inlet screen's elongated bars?",
8. "answer": "The heating mechanism involves electric heating means, such as at least one electric heating member, which is introduced into a space defined along a portion of the elongation of the bar to provide heat and prevent ice buildup."
9. },
10. {
11. "question": "What are the structural features of the elongated bars in the inlet screen?",
12. "answer": "Each elongated bar has a proximal and distal portion, an upstream and downstream region at an angle to the proximal and distal portions, and an intermediate portion between the upstream and downstream regions where a space for the heating means is located."
13. }
14. ]
patents_qa_df_llama.head(5)[['question','answer']]
| question | answer | |
|---|---|---|
| 0 | What is the purpose of the cold accumulator evaporator? | To supply cold to the cold storage, reducing the cost of the cold storage and saving energy. |
| 1 | What is the role of the refrigeration house condenser? | It is connected with the cold accumulation compressor unit and supplies steam to it. |
| 2 | What is the overall function of the distributed photovoltaic energy storage refrigeration house ... | To convert solar energy into electric energy to store ice and provide cold storage while saving ... |
| 3 | What is the purpose of the distributed photovoltaic energy storage refrigeration house system? | The system supplies cold to the cold storage, reducing the cost of cold storage and saving energy. |
| 4 | What is the main function of the cold accumulation equipment? | The cold accumulation equipment stores cold energy from the solar panels, which is then used to ... |
patents_qa_df_llama.to_csv('patents_qa_df_510_llama.csv', index=False)
QA Pairs with Deepseek R1¶
Generate the QA Pairs for the Media Dataset¶
models = client.models.list()
for model in models:
if 'deepseek' in model.id:
print(model.id)
MODEL = 'deepseek/deepseek-chat-v3-0324'
media_qa_df_deepseek = generate_qa_pairs_media(combined_df, target_qa_pairs=50, dataset_type="media",model = MODEL)
media_qa_df_deepseek.to_csv("data/media_qa_pairs_deepseek.csv", index= False)
media_qa_df_deepseek.head()[['question','answer']]
media_qa_df_deepseek = pd.read_csv("datasets/media_qa_pairs_deepseek.csv")
media_qa_df_deepseek.head(4)[['question','answer']]
| question | answer | |
|---|---|---|
| 0 | ** What is green hydrogen, and how is it produced? \n** | ** Green hydrogen is a sustainable energy carrier produced through water electrolysis powered by... |
| 1 | ** What is the purpose of Asahi Kasei’s new hydrogen pilot plant in Kawasaki? \n** | ** The pilot plant is designed to test the Aqualyzer™ alkaline water electrolyzer under realisti... |
| 2 | What is the main concern raised by Julienne Stroeve's research regarding Hudson Bay polar bears? | The research highlights that global warming could extend the ice-free period in Hudson Bay, shor... |
| 3 | How does the projected temperature increase (beyond 2.1°C) impact the ice-free period in Hudson ... | If global warming surpasses 2.1°C, the ice-free period in parts of Hudson Bay could exceed 183 d... |
Generate the QA Pairs for the Patent Dataset¶
patents_qa_df_deepseek = generate_qa_pairs_patent(patent_df_related, target_qa_pairs=1,model=MODEL)
patents_qa_df_deepseek.to_csv('data/patents_qa_df_deepseek.csv', index=False)
patents_qa_df_deepseek = pd.read_csv('datasets/patents_qa_df_deepseek.csv')
patents_qa_df_deepseek.head(4)[['question','answer']]
| question | answer | |
|---|---|---|
| 0 | What are the two types of electron acceptor materials used in the disclosed ternary blends for o... | The two types are N-annulated perylene diimide-based electron acceptor material and BTP-based el... |
| 1 | Which electron donor material is used in the ternary bulk heterojunction (BHJ) blends described ... | The electron donor material used is poly[(2,5-bis(2-hexyldecyloxy)phenylene)-alt-(5,6-difluoro-4... |
| 2 | What is the purpose of incorporating N-annulated perylene diimide dimer (tPDI2N-EH, PDI) or BTP-... | They are incorporated to create ternary bulk heterojunction (BHJ) blends for low-intensity light... |
| 3 | What are the two types of electron acceptor materials used in the ternary blends for organic pho... | The two types of electron acceptor materials are N-annulated perylene diimide-based electron acc... |
Key Observations¶
Uniqueness is high for Llama (almost 1:1 question-to-pair), whereas Deepseek shows slight redundancy (e.g. 147 pairs but only 138 unique).
Median question lengths are nearly identical across methods (~12–13 words), indicating a similar level of question complexity or focus.
Deepseek answers are substantially longer on average (≈43 words for media, 32 for patents) than Llama’s (≈19–21 words). This suggests Deepseek prioritizes more elaborate explanations, while Llama keeps answers concise.
Based on our comparison, we’ll proceed with the Llama 3.2 generated QA sets for both media and patents. Llama’s output offers the greatest breadth (highest count), near one to one question uniqueness, and concise answers ideal for large scale RAG evaluation.
counts = {
'Media (Llama)': len(media_qa_df_llama),
'Media (Deepseek)': len(media_qa_df_deepseek),
'Patents (Llama)': len(patents_qa_df_llama),
'Patents (Deepseek)': len(patents_qa_df_deepseek),
}
plt.figure(figsize=(6,4))
plt.bar(counts.keys(), counts.values())
plt.title("Total QA Pairs by Domain & Method")
plt.ylabel("Number of QA Pairs")
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()
4. QA Pair Classification¶
After generating our QA pairs, our next step is to classify each question into a category. This categorization is essential to ensure that our QA pairs cover diverse aspects of cleantech, such as factual details, comparative market analysis, analytical explanations, government or corporate initiatives, and sustainability or technological innovations.
For each generated QA pair, we want to automatically determine the most suitable category based on the question's content. We experimented with two methods:
A Rule-Based Approach:
This method involves defining a dictionary of categories along with associated keywords. For each question, we count the occurrences of each keyword corresponding to every category. The category with the highest score is assigned to the question. It is easily interpretable since we can trace the category decision to keyword matches. However, it is limited flexibility if questions use synonyms or varied phrasing that is not in the keyword list.
BERT Sentence Approach: In this approach, we use a SentenceTransformer model to encode each generated question into an embedding vector. We then compare these embeddings to precomputed embeddings of category prototype texts (i.e., descriptions of what each category should capture) using cosine similarity. The category with the highest similarity score is assigned to the question. In this approach, Tthe quality of classification heavily depends on the choice of the prototype texts and the underlying model. We have evaluated both methods and found that the BERT sentence approach outperformed the rule-based method. For patent questions, we even experimented with replacing the all-MiniLM-L6-v2 model with the AI-Growth-Lab/PatentSBERTa model, which gave even better results in capturing nuanced technical details.
patent_qa_df_llama = pd.read_csv("datasets/patents_qa_df_510_llama_new.csv")
media_qa_df_llama = pd.read_csv("datasets/media_qa_df_llama.csv")
patents_qa_df = [patent_qa_df_llama]
media_qa_df = [media_qa_df_llama]
The rule based approach¶
def classify_question_rule_based(question, categories):
question_lower = question.lower()
scores = {cat: 0 for cat in categories}
for cat, keywords in categories.items():
for kw in keywords:
if kw in question_lower:
scores[cat] += 1
return max(scores, key=scores.get)
categories = {
"Factual Questions": ["what", "when", "where", "who", "which", "fact"],
"Comparative & Market Analysis Questions": ["compare", "contrast", "market", "trend", "analysis"],
"Analytical & Explanatory Questions": ["why", "how", "explain", "impact"],
"Government & Corporate Initiatives": ["government", "policy", "corporate", "initiative", "regulation"],
"Sustainability & Technological Innovation Questions": ["sustainable", "innovation", "technology", "renewable", "eco-friendly"]
}
def stack_and_categorize(dfs, categories):
df_combined = pd.concat(dfs, ignore_index=True)
df_combined['category'] = df_combined['question'].apply(lambda q: classify_question_rule_based(q, categories))
return df_combined
df_patent_categorized_rule_based = stack_and_categorize(patents_qa_df, categories)
df_media_categorized_rule_based = stack_and_categorize(media_qa_df, categories)
df_patent_categorized_rule_based.head(3)
| question | answer | relevant_text | title | publication_date | inventor | category | |
|---|---|---|---|---|---|---|---|
| 0 | What is the purpose of the cold accumulator evaporator? | To supply cold to the cold storage, reducing the cost of the cold storage and saving energy. | Distributed photovoltaic energy storage refrigeration house systemThe utility model discloses a ... | Distributed photovoltaic energy storage refrigeration house system | 20220405 | ['ZHANG YU', 'WANG GUANGLIN'] | Factual Questions |
| 1 | What is the role of the refrigeration house condenser? | It is connected with the cold accumulation compressor unit and supplies steam to it. | Distributed photovoltaic energy storage refrigeration house systemThe utility model discloses a ... | Distributed photovoltaic energy storage refrigeration house system | 20220405 | ['ZHANG YU', 'WANG GUANGLIN'] | Factual Questions |
| 2 | What is the overall function of the distributed photovoltaic energy storage refrigeration house ... | To convert solar energy into electric energy to store ice and provide cold storage while saving ... | Distributed photovoltaic energy storage refrigeration house systemThe utility model discloses a ... | Distributed photovoltaic energy storage refrigeration house system | 20220405 | ['ZHANG YU', 'WANG GUANGLIN'] | Factual Questions |
df_media_categorized_rule_based.head(3)
| question | relevant_text | answer | article_url | category | |
|---|---|---|---|---|---|
| 0 | What is green hydrogen, and how is it produced? | As we move toward a sustainable future, green hydrogen is a significant player in the renewable ... | Green hydrogen is a sustainable energy carrier produced by water electrolysis using renewable en... | https://www.azocleantech.com/article.aspx?ArticleID=1823 | Factual Questions |
| 1 | Who are some of the key players involved in the development and adoption of green hydrogen? | As we move toward a sustainable future, green hydrogen is a significant player in the renewable ... | Key players include companies like Siemens Energy, ITM Power, and H2Pro, as well as academia and... | https://www.azocleantech.com/article.aspx?ArticleID=1823 | Factual Questions |
| 2 | What are the main challenges associated with integrating green hydrogen into the energy grid? | As we move toward a sustainable future, green hydrogen is a significant player in the renewable ... | The main challenges include technical hurdles such as lower energy efficiency, safe storage and ... | https://www.azocleantech.com/article.aspx?ArticleID=1823 | Factual Questions |
The Sentence transformer approach¶
The all-MiniLM-L6-v2 approach¶
from sentence_transformers import SentenceTransformer, util
import torch
prototypes = {
"Factual Questions": "Questions that ask for specific facts or details.",
"Comparative & Market Analysis Questions": "Questions that compare options or analyze market trends.",
"Analytical & Explanatory Questions": "Questions that require detailed analysis or explanation.",
"Government & Corporate Initiatives": "Questions related to government policies or corporate initiatives.",
"Sustainability & Technological Innovation Questions": "Questions focusing on sustainability and innovative technology."
}
model = SentenceTransformer('all-MiniLM-L6-v2')
prototype_texts = list(prototypes.values())
prototype_labels = list(prototypes.keys())
prototype_embeddings = model.encode(prototype_texts, convert_to_tensor=True)
def classify_question_embedding(question: str) -> str:
"""
Classify a question into one of the five categories using cosine similarity
between its embedding and the precomputed prototype embeddings.
"""
question_embedding = model.encode(question, convert_to_tensor=True)
cosine_scores = util.cos_sim(question_embedding, prototype_embeddings)
best_idx = torch.argmax(cosine_scores).item()
return prototype_labels[best_idx]
def stack_and_categorize(dfs: list) -> pd.DataFrame:
"""
Stack multiple DataFrames and add a new 'category' column by classifying each question.
"""
df_combined = pd.concat(dfs, ignore_index=True)
df_combined['category'] = df_combined['question'].apply(lambda q: classify_question_embedding(q))
return df_combined
df_patent_categorized_sentence_bert = stack_and_categorize(patents_qa_df)
df_media_categorized_sentence_bert = stack_and_categorize(media_qa_df)
df_patent_categorized_sentence_bert.head(3)
| question | answer | relevant_text | title | publication_date | inventor | category | |
|---|---|---|---|---|---|---|---|
| 0 | What is the purpose of the cold accumulator evaporator? | To supply cold to the cold storage, reducing the cost of the cold storage and saving energy. | Distributed photovoltaic energy storage refrigeration house systemThe utility model discloses a ... | Distributed photovoltaic energy storage refrigeration house system | 20220405 | ['ZHANG YU', 'WANG GUANGLIN'] | Analytical & Explanatory Questions |
| 1 | What is the role of the refrigeration house condenser? | It is connected with the cold accumulation compressor unit and supplies steam to it. | Distributed photovoltaic energy storage refrigeration house systemThe utility model discloses a ... | Distributed photovoltaic energy storage refrigeration house system | 20220405 | ['ZHANG YU', 'WANG GUANGLIN'] | Government & Corporate Initiatives |
| 2 | What is the overall function of the distributed photovoltaic energy storage refrigeration house ... | To convert solar energy into electric energy to store ice and provide cold storage while saving ... | Distributed photovoltaic energy storage refrigeration house systemThe utility model discloses a ... | Distributed photovoltaic energy storage refrigeration house system | 20220405 | ['ZHANG YU', 'WANG GUANGLIN'] | Sustainability & Technological Innovation Questions |
df_media_categorized_sentence_bert.head(3)
| question | relevant_text | answer | article_url | category | |
|---|---|---|---|---|---|
| 0 | What is green hydrogen, and how is it produced? | As we move toward a sustainable future, green hydrogen is a significant player in the renewable ... | Green hydrogen is a sustainable energy carrier produced by water electrolysis using renewable en... | https://www.azocleantech.com/article.aspx?ArticleID=1823 | Sustainability & Technological Innovation Questions |
| 1 | Who are some of the key players involved in the development and adoption of green hydrogen? | As we move toward a sustainable future, green hydrogen is a significant player in the renewable ... | Key players include companies like Siemens Energy, ITM Power, and H2Pro, as well as academia and... | https://www.azocleantech.com/article.aspx?ArticleID=1823 | Sustainability & Technological Innovation Questions |
| 2 | What are the main challenges associated with integrating green hydrogen into the energy grid? | As we move toward a sustainable future, green hydrogen is a significant player in the renewable ... | The main challenges include technical hurdles such as lower energy efficiency, safe storage and ... | https://www.azocleantech.com/article.aspx?ArticleID=1823 | Sustainability & Technological Innovation Questions |
The AI-Growth-Lab/PatentSBERTa approach¶
model = SentenceTransformer('AI-Growth-Lab/PatentSBERTa')
prototype_texts = list(prototypes.values())
prototype_labels = list(prototypes.keys())
prototype_embeddings = model.encode(prototype_texts, convert_to_tensor=True)
df_patent_categorized_patent_sentence_bert = stack_and_categorize(patents_qa_df)
df_media_categorized_patent_sentence_bert= stack_and_categorize(media_qa_df)
df_patent_categorized_patent_sentence_bert.head(3)
| question | answer | relevant_text | title | publication_date | inventor | category | |
|---|---|---|---|---|---|---|---|
| 0 | What is the purpose of the cold accumulator evaporator? | To supply cold to the cold storage, reducing the cost of the cold storage and saving energy. | Distributed photovoltaic energy storage refrigeration house systemThe utility model discloses a ... | Distributed photovoltaic energy storage refrigeration house system | 20220405 | ['ZHANG YU', 'WANG GUANGLIN'] | Analytical & Explanatory Questions |
| 1 | What is the role of the refrigeration house condenser? | It is connected with the cold accumulation compressor unit and supplies steam to it. | Distributed photovoltaic energy storage refrigeration house systemThe utility model discloses a ... | Distributed photovoltaic energy storage refrigeration house system | 20220405 | ['ZHANG YU', 'WANG GUANGLIN'] | Analytical & Explanatory Questions |
| 2 | What is the overall function of the distributed photovoltaic energy storage refrigeration house ... | To convert solar energy into electric energy to store ice and provide cold storage while saving ... | Distributed photovoltaic energy storage refrigeration house systemThe utility model discloses a ... | Distributed photovoltaic energy storage refrigeration house system | 20220405 | ['ZHANG YU', 'WANG GUANGLIN'] | Sustainability & Technological Innovation Questions |
df_media_categorized_patent_sentence_bert.head(3)
| question | relevant_text | answer | article_url | category | |
|---|---|---|---|---|---|
| 0 | What is green hydrogen, and how is it produced? | As we move toward a sustainable future, green hydrogen is a significant player in the renewable ... | Green hydrogen is a sustainable energy carrier produced by water electrolysis using renewable en... | https://www.azocleantech.com/article.aspx?ArticleID=1823 | Sustainability & Technological Innovation Questions |
| 1 | Who are some of the key players involved in the development and adoption of green hydrogen? | As we move toward a sustainable future, green hydrogen is a significant player in the renewable ... | Key players include companies like Siemens Energy, ITM Power, and H2Pro, as well as academia and... | https://www.azocleantech.com/article.aspx?ArticleID=1823 | Sustainability & Technological Innovation Questions |
| 2 | What are the main challenges associated with integrating green hydrogen into the energy grid? | As we move toward a sustainable future, green hydrogen is a significant player in the renewable ... | The main challenges include technical hurdles such as lower energy efficiency, safe storage and ... | https://www.azocleantech.com/article.aspx?ArticleID=1823 | Sustainability & Technological Innovation Questions |
Visualizing and Comparing Classification Approaches¶
After classifying our QA pairs, we plotted the number of questions in each category to evaluate how well each approach captured the variety of question types.
Rule Based Approach The rule-based approach appears biased toward Factual Questions because it heavily relies on keyword matching. When questions do not contain the keywords associated with the other categories, they are often labeled as factual by default.
BERT Sentence Approach The BERT based semantic classification captures more nuance and distributes questions more evenly across multiple categories. It identifies many questions about sustainability and technological innovation highlighting how semantic embeddings can detect relevant context even if explicit keywords are missing.
Patent Specific BERT Model (PatentSBERTa) Because PatentSBERTa is fine tuned for patent text, it performs even better at placing technical questions into the correct categories. This is especially important for recognizing subtle phrases and context in patent abstracts.
We decided to go with the Bert approach for classifying all QA pairs, ensuring more nuanced and reliable coverage of the different cleantech question types.
categories_rule_based = df_patent_categorized_rule_based['category'].value_counts()
barplot = sns.barplot(
x=categories_rule_based.values,
y=categories_rule_based.index,
hue=categories_rule_based.index
)
barplot.set_title('Question Count by Categories')
barplot.set_xlabel('Question Count')
barplot.set_ylabel('Category')
plt.show()
categories_sentence_bert = df_media_categorized_sentence_bert['category'].value_counts()
barplot = sns.barplot(
x=categories_sentence_bert.values,
y=categories_sentence_bert.index,
hue=categories_sentence_bert.index
)
barplot.set_title('Question Count by Categories')
barplot.set_xlabel('Question Count')
barplot.set_ylabel('Category')
plt.show()
categories_patent_bert = df_media_categorized_patent_sentence_bert['category'].value_counts()
barplot = sns.barplot(
x=categories_patent_bert.values,
y=categories_patent_bert.index,
hue=categories_patent_bert.index
)
barplot.set_title('Question Count by Categories')
barplot.set_xlabel('Question Count')
barplot.set_ylabel('Category')
plt.show()
The QAs pairs are divided in these questions categories:¶
Factual Questions
Comparative & Market Analysis Questions
Analytical & Explanatory Questions
Government & Corporate Initiatives
Sustainability & Technological Innovation Questions
Factual Questions¶
Factual Questions: Ask for concrete details or definitions (e.g., “What is…”).
factual_questions = df_media_categorized_sentence_bert [df_media_categorized_sentence_bert ['category'] == 'Factual Questions']['question'].tolist()
factual_df = df_media_categorized_sentence_bert [df_media_categorized_sentence_bert ['category'] == 'Factual Questions']
show_questions("Factual Questions for the Media Dataset", factual_questions)
Factual Questions for the Media Dataset --------------------------------------- 1. How long will your questions, but not your email details, be shared with OpenAI? 2. What is the Hudson Bay located in? 3. How many polar bears are estimated to reside in the Hudson Bay? 4. Why is sea ice important for polar bears in the Hudson Bay? 5. What is the current ice-free period in the Hudson Bay? 6. What is the maximum fasting time that adult polar bears can tolerate?
plot_character_distribution(factual_df)
plot_word_distribution(factual_df)
plot_gpt4_token_distribution(factual_df)
plot_non_stopword_token_distribution(factual_df)
factual_questions = df_patent_categorized_sentence_bert [df_patent_categorized_sentence_bert ['category'] == 'Factual Questions']['question'].tolist()
factual_df = df_patent_categorized_sentence_bert [df_patent_categorized_sentence_bert ['category'] == 'Factual Questions']
show_questions("Factual Questions", factual_questions)
plot_character_distribution(factual_df)
plot_word_distribution(factual_df)
plot_gpt4_token_distribution(factual_df)
plot_non_stopword_token_distribution(factual_df)
Factual Questions for the Patent Dataset ----------------- 1. What is the purpose of the tower heliostat? 2. What is the role of the tower type heliostat in the system? 3. What kind of data do we need to acquire to balance electric quantity in a power system containing large-scale renewable energy sources? 4. What system does the device utilize for control and temperature setting? 5. How does the system get power supply? 6. What is the key component of the cable core that ensures symmetry between the main wire core and the ground wire core?
Comparative & Market Analysis Questions¶
Involve comparing cleantech technologies or approaches (e.g., “What are the differences between…”).
Comparative_Market_Analysis_Questions = df_media_categorized_sentence_bert [df_media_categorized_sentence_bert ['category'] == 'Comparative & Market Analysis Questions']['question'].tolist()
Comparative_Market_Analysis_df = df_media_categorized_sentence_bert [df_media_categorized_sentence_bert ['category'] == 'Comparative & Market Analysis Questions']
show_questions("Comparative & Market Analysis Questions for the Media Dataset", Comparative_Market_Analysis_Questions)
Comparative & Market Analysis Questions for the Media Dataset ------------------------------------------------------------- 1. What is the estimated total sales volume target for Asahi Kasei's hydrogen business by 2030? 2. How has the Hudson Bay area warmed over the last three decades? 3. Who will lead GRNE after the acquisition? 4. How does GRNE plan to continue growing after the acquisition? 5. Who has expressed excitement about the acquisition? 6. Where are large solar markets more prone to extreme weather-related damage?
plot_character_distribution(Comparative_Market_Analysis_df)
plot_word_distribution(Comparative_Market_Analysis_df)
plot_gpt4_token_distribution(Comparative_Market_Analysis_df)
plot_non_stopword_token_distribution(Comparative_Market_Analysis_df)
Comparative_Market_Analysis_Questions = df_patent_categorized_sentence_bert [df_patent_categorized_sentence_bert ['category'] == 'Comparative & Market Analysis Questions']['question'].tolist()
Comparative_Market_Analysis_df = df_patent_categorized_sentence_bert [df_patent_categorized_sentence_bert ['category'] == 'Comparative & Market Analysis Questions']
show_questions("Comparative & Market Analysis Questions for the Patent Dataset", Comparative_Market_Analysis_df)
plot_character_distribution(Comparative_Market_Analysis_df)
plot_word_distribution(Comparative_Market_Analysis_df)
plot_gpt4_token_distribution(Comparative_Market_Analysis_df)
plot_non_stopword_token_distribution(Comparative_Market_Analysis_df)
Comparative & Market Analysis Questions for the Patent Dataset -------------------------------------------------------------- 1. question 2. answer 3. relevant_text 4. title 5. publication_date 6. inventor 7. category
Analytical & Explanatory Questions¶
equire reasoning or explanation (e.g., “How does…”).
Analytical_Explanatory_Questions = df_media_categorized_sentence_bert [df_media_categorized_sentence_bert ['category'] == 'Analytical & Explanatory Questions']['question'].tolist()
Analytical_Explanatory_Questions_df = df_media_categorized_sentence_bert [df_media_categorized_sentence_bert ['category'] == 'Analytical & Explanatory Questions']
show_questions("Analytical & Explanatory Questions for the Media Dataset", Analytical_Explanatory_Questions)
Analytical & Explanatory Questions for the Media Dataset -------------------------------------------------------- 1. What is the need for robust safety protocols and storage solutions for hydrogen due to its highly flammable nature? 2. Where did Julienne Stroeve and colleagues estimate projected changes to the ice-free time? 3. What technique is used to separate batteries' cathodes and anodes? 4. What is the role of the EverBatt model in the research project? 5. What type of testing is being done by researchers at MERF? 6. What is the limitation of the current study, according to the researchers?
plot_character_distribution(Analytical_Explanatory_Questions_df)
plot_word_distribution(Analytical_Explanatory_Questions_df)
plot_gpt4_token_distribution(Analytical_Explanatory_Questions_df)
plot_non_stopword_token_distribution(Analytical_Explanatory_Questions_df)
Analytical_Explanatory_Questions = df_patent_categorized_sentence_bert [df_patent_categorized_sentence_bert ['category'] == 'Analytical & Explanatory Questions']['question'].tolist()
Analytical_Explanatory_Questions_df = df_patent_categorized_sentence_bert [df_patent_categorized_sentence_bert ['category'] == 'Analytical & Explanatory Questions']
show_questions("Analytical & Explanatory Questions for the Patent Dataset", Analytical_Explanatory_Questions)
plot_character_distribution(Analytical_Explanatory_Questions_df)
plot_word_distribution(Analytical_Explanatory_Questions_df)
plot_gpt4_token_distribution(Analytical_Explanatory_Questions_df)
plot_non_stopword_token_distribution(Analytical_Explanatory_Questions_df)
Analytical & Explanatory Questions for the Patent Dataset --------------------------------------------------------- 1. What is the purpose of the cold accumulator evaporator? 2. What is the purpose of the utility model? 3. What makes the structure easy to handle? 4. What is the main challenge that the utility model solves? 5. What is good about the utility model? 6. What is facilitated by the utility model?
Government & Corporate Initiatives¶
Government_Corporate_Initiatives = df_media_categorized_sentence_bert [df_media_categorized_sentence_bert ['category'] == 'Government & Corporate Initiatives']['question'].tolist()
Government_Corporate_Initiatives_df = df_media_categorized_sentence_bert [df_media_categorized_sentence_bert ['category'] == 'Government & Corporate Initiatives']
show_questions("Government & Corporate Initiatives for the Media Dataset", Government_Corporate_Initiatives)
Government & Corporate Initiatives for the Media Dataset -------------------------------------------------------- 1. How does the Hydrogen Council facilitate partnerships and collaboration among companies, governments, and academia? 2. What is the role of international cooperation in promoting green hydrogen adoption and harmonizing regulations? 3. When did the company start operation on its commercial-scale facility in Kawasaki, Japan? 4. Who is the Lead Executive Officer of Asahi Kasei responsible for the company's hydrogen business? 5. What is the name of the company discussed in the interview? 6. Who is behind the collaboration between the US Department of Energy's Argonne National Laboratory and Toyota Motor North America?
plot_character_distribution(Government_Corporate_Initiatives_df)
plot_word_distribution(Government_Corporate_Initiatives_df)
plot_gpt4_token_distribution(Government_Corporate_Initiatives_df)
plot_non_stopword_token_distribution(Government_Corporate_Initiatives_df)
Government_Corporate_Initiatives = df_patent_categorized_sentence_bert [df_patent_categorized_sentence_bert ['category'] == 'Government & Corporate Initiatives']['question'].tolist()
Government_Corporate_Initiatives_df = df_patent_categorized_sentence_bert [df_patent_categorized_sentence_bert ['category'] == 'Government & Corporate Initiatives']
show_questions("Government & Corporate Initiatives for the Patent Dataset", Government_Corporate_Initiatives)
plot_character_distribution(Government_Corporate_Initiatives_df)
plot_word_distribution(Government_Corporate_Initiatives_df)
plot_gpt4_token_distribution(Government_Corporate_Initiatives_df)
plot_non_stopword_token_distribution(Government_Corporate_Initiatives_df)
Government & Corporate Initiatives for the Patent Dataset --------------------------------------------------------- 1. What is the role of the refrigeration house condenser? 2. What are the main components of the cable core? 3. What type of insulation is used on the main wire core? 4. What are the three key components of the cable core in the insulation low-smoke halogen-free torsion-resistant medium-voltage wind energy cable? 5. How is the main wire core formed during the manufacturing process? 6. What is the arrangement of the main wire core, ground wire core, ethylene propylene fillings, and semi-conductive filling in the cable core?
Sustainability & Technological Innovation Questions¶
Sustainability_Technological_Innovation_Questions = df_media_categorized_sentence_bert [df_media_categorized_sentence_bert ['category'] == 'Sustainability & Technological Innovation Questions']['question'].tolist()
Sustainability_Technological_Innovation_Questions_df = df_media_categorized_sentence_bert [df_media_categorized_sentence_bert ['category'] == 'Sustainability & Technological Innovation Questions']
show_questions("Sustainability & Technological Innovation Questions for the Media Dataset", Sustainability_Technological_Innovation_Questions)
Sustainability & Technological Innovation Questions for the Media Dataset ------------------------------------------------------------------------- 1. What is green hydrogen, and how is it produced? 2. Who are some of the key players involved in the development and adoption of green hydrogen? 3. What are the main challenges associated with integrating green hydrogen into the energy grid? 4. What is the role of the European Union's Hydrogen Strategy in promoting green hydrogen adoption? 5. What is the significance of the Neom project in Saudi Arabia as a pioneering example of green hydrogen integration? 6. What is the goal of the HyGreen Provence Project in France, and how does it contribute to regional energy transition efforts?
plot_character_distribution(Sustainability_Technological_Innovation_Questions_df)
plot_word_distribution(Sustainability_Technological_Innovation_Questions_df)
plot_gpt4_token_distribution(Sustainability_Technological_Innovation_Questions_df)
plot_non_stopword_token_distribution(Sustainability_Technological_Innovation_Questions_df)
Sustainability_Technological_Innovation_Questions = df_patent_categorized_sentence_bert [df_patent_categorized_sentence_bert ['category'] == 'Sustainability & Technological Innovation Questions']['question'].tolist()
Sustainability_Technological_Innovation_Questions_df = df_patent_categorized_sentence_bert [df_patent_categorized_sentence_bert ['category'] == 'Sustainability & Technological Innovation Questions']
show_questions("Sustainability & Technological Innovation Questions for the Patent Dataset", Sustainability_Technological_Innovation_Questions)
plot_character_distribution(Sustainability_Technological_Innovation_Questions_df)
plot_word_distribution(Sustainability_Technological_Innovation_Questions_df)
plot_gpt4_token_distribution(Sustainability_Technological_Innovation_Questions_df)
plot_non_stopword_token_distribution(Sustainability_Technological_Innovation_Questions_df)
Sustainability & Technological Innovation Questions for the Patent Dataset -------------------------------------------------------------------------- 1. What is the overall function of the distributed photovoltaic energy storage refrigeration house system? 2. What is the purpose of the distributed photovoltaic energy storage refrigeration house system? 3. What is the main function of the cold accumulation equipment? 4. How does the system reduce the cost of cold storage? 5. What is unique about the mounting panel? 6. What is the purpose of the second isolation layer?
5. Setup critique agents¶
The questions generated by our LLM may exhibit various flaws. For instance, some questions might lack sufficient context to be answered accurately, and others may depend too much on the original document context, making them unclear when viewed in isolation.
To address this, we build critique agents automated evaluators that rate each QA pair on three key criteria:
Groundedness: Can the question be answered unambiguously given the provided context?
Relevance: Is the question useful to machine learning developers building NLP applications? (For example, a question like “What is the date when transformers 4.29.1 was released?” is not very relevant.)
Standalone Clarity: Does the question make sense on its own without requiring additional context? (A question that refers implicitly to a specific setting without clarification would score poorly here.)
These criteria are inspired by the evaluation methods described in this paper. To obtain a reliable judgment, we prompt the LLM with instructions that ask for not only a numerical score (on a scale from 1 to 5) but also a detailed rationale for that score. Providing the rationale helps verify the scoring process and gives the model additional tokens to “think” about the answer before summarizing it.
Below is the code that sets up the critique agents and applies them to our generated QA pairs.
patent_qa_df_llama = pd.read_csv("datasets/QA_pairs_categorized_patent_qa_df_llama.csv")
media_qa_df_llama = pd.read_csv("datasets/QA_pairs_categorized_media_qa_df_llama.csv")
media_qa_deepseek = pd.read_csv("datasets/QA_pairs_categorized_media_qa_df_deepseek.csv")
patent_qa_deepseek = pd.read_csv("datasets/QA_pairs_categorized_paten_qa_df_deepseek.csv")
Converting QA Pairs into LangChain Documents & Standardized Outputs¶
To plug our QA pairs into LangChain and downstream RAG components, we define three helpers:
generate_langchain_document_list(df, name=None)
Transforms each QA row into aLangchainDocumentwith:page_contentset to the QA’srelevant_textmetadatacontaining the question, answer, category, title, and source URL
generate_output_list(df, name=None)
Creates plain dictionaries for each QA pair, renaming keys to match RAG API expectations:- For media: maps
relevant_text→context,article_url→source_doc, plusquestion,answer, andcategory. - For patents: omits
source_docand retainscontext,question,answer, andcategory.
- For media: maps
remove_duplicates(outputs)
Ensures each (context, question) pair is unique by filtering out any repeated combinations, preserving only the first occurrence.
# convert the df to langchain document
def generate_langchain_document_list(df, name=None):
"""
Generates a list of LangchainDocument objects from a DataFrame.
Parameters:
df (pd.DataFrame): DataFrame containing the data.
name (str, optional): An optional parameter for conditional adjustments.
Returns:
list: A list of LangchainDocument objects.
"""
records = df.to_dict("records")
documents = [
LangchainDocument(
page_content=row["relevant_text"],
metadata={
"question": row["question"],
"answer": row["answer"],
"category": row["category"],
"title": row["title"],
"url": row["article_url"],
}
)
for row in records
]
return documents
def generate_output_list(df, name=None):
records = df.to_dict("records")
if name == "patent":
key_mapping = {
"relevant_text": "context",
"question": "question",
"answer": "answer",
"category": "category"
}
else:
key_mapping = {
"relevant_text": "context",
"question": "question",
"answer": "answer",
"article_url": "source_doc",
"category": "category"
}
output_list = [
{ new_key: row[old_key] for old_key, new_key in key_mapping.items() }
for row in records
]
return output_list
def remove_dublicates(outputs):
seen_content = {}
unique_outputs = []
for doc in outputs:
content_question_pair = (doc['context'], doc['question'])
# content_question_pair = (doc.page_content, doc.metadata['question'])
if content_question_pair not in seen_content:
seen_content[content_question_pair] = True
unique_outputs.append(doc)
return unique_outputs
For the 2 llama datasets convert the output to a langchain compatible output¶
For the Media Dataset¶
media_qa_llama = [
media_qa_df_llama
]
media_qa_df_llama = stack_and_categorize(media_qa_llama)
media_qa_df_llama.to_csv('data/QA_pairs_categorized_media_qa_df_llama.csv')
output_media_llama = generate_output_list(media_qa_df_llama,name= 'media')
len(output_media_llama)
746
output_media_cleaned_llama = remove_dublicates(output_media_llama)
print(f'the media dataset before the cleaning had {len(output_media_llama)} and now it has {len(output_media_cleaned_llama)}')
the media dataset before the cleaning had 746 and now it has 746
so the media qa dataset had no duplicates
For the Patent Dataset¶
patent_qa_llama = [
patent_qa_df_llama
]
patent_qa_df_llama = stack_and_categorize(patent_qa_llama)
patent_qa_df_llama.to_csv('data/QA_pairs_categorized_patent_qa_df_llama.csv')
output_patent_llama = generate_output_list(patent_qa_df_llama,name= 'patent')
output_patent_cleaned_llama = remove_dublicates(output_patent_llama)
print(f'the patent dataset before the cleaning had {len(output_patent_llama)} and now it has {len(output_patent_cleaned_llama)}')
the patent dataset before the cleaning had 523 and now it has 493
For the 2 deepseek datasets convert the output to a langchain compatible output¶
For the Media Dataset¶
media_qa_deepseek = [
media_qa_deepseek
]
media_qa_df_deepseek = stack_and_categorize(media_qa_deepseek)
media_qa_df_deepseek.to_csv('data/QA_pairs_categorized_media_qa_df_deepseek.csv')
output_media_deepseek = generate_output_list(media_qa_df_deepseek,name= 'media')
output_media_cleaned_deepseek = remove_dublicates(output_media_deepseek)
print(f'the media dataset before the cleaning had {len(output_media_deepseek)} and now it has {len(output_media_cleaned_deepseek)}')
the media dataset before the cleaning had 50 and now it has 50
For the Patent Dataset¶
patent_qa_deepseek = [
patent_qa_deepseek
]
patent_qa_df_deepseek = stack_and_categorize(patent_qa_deepseek)
patent_qa_df_deepseek.to_csv('data/QA_pairs_categorized_paten_qa_df_deepseek.csv')
output_patent_deepseek = generate_output_list(patent_qa_df_deepseek,name= 'patent')
output_patent_cleaned_deepseek = remove_dublicates(output_patent_deepseek)
print(f'the media dataset before the cleaning had {len(output_patent_deepseek)} and now it has {len(output_patent_cleaned_deepseek)}')
the media dataset before the cleaning had 147 and now it has 138
Critique promts for evaluation¶
question_groundedness_critique_prompt = """
You will be given a context and a question.
Your task is to provide a 'total rating' scoring how well one can answer the given question unambiguously with the given context.
Give your answer on a scale of 1 to 5, where 1 means that the question is not answerable at all given the context, and 5 means that the question is clearly and unambiguously answerable with the context.
Provide your answer as follows:
Answer:::
Evaluation: (your rationale for the rating, as a text)
Total rating: (your rating, as a number between 1 and 5)
You MUST provide values for 'Evaluation:' and 'Total rating:' in your answer.
Now here are the question and context.
Question: {question}\n
Context: {context}\n
Answer::: """
question_relevance_critique_prompt = """
You will be given a question.
Your task is to provide a 'total rating' representing how useful this question can be to machine learning developers building NLP applications with the Hugging Face ecosystem.
Give your answer on a scale of 1 to 5, where 1 means that the question is not useful at all, and 5 means that the question is extremely useful.
Provide your answer as follows:
Answer:::
Evaluation: (your rationale for the rating, as a text)
Total rating: (your rating, as a number between 1 and 5)
You MUST provide values for 'Evaluation:' and 'Total rating:' in your answer.
Now here is the question.
Question: {question}\n
Answer::: """
question_standalone_critique_prompt = """
You will be given a question.
Your task is to provide a 'total rating' representing how context-independent this question is.
Give your answer on a scale of 1 to 5, where 1 means that the question depends on additional information to be understood, and 5 means that the question makes sense by itself.
For instance, if the question refers to a particular setting, like 'in the context' or 'in the document', the rating must be 1.
The questions can contain obscure technical nouns or acronyms like Gradio, Hub, Hugging Face or Space and still be a 5: it must simply be clear to an operator with access to documentation what the question is about.
For instance, "What is the name of the checkpoint from which the ViT model is imported?" should receive a 1, since there is an implicit mention of a context, thus the question is not independent from the context.
Provide your answer as follows:
Answer:::
Evaluation: (your rationale for the rating, as a text)
Total rating: (your rating, as a number between 1 and 5)
You MUST provide values for 'Evaluation:' and 'Total rating:' in your answer.
Now here is the question.
Question: {question}\n
Answer::: """
Automating LLM Based QA Evaluation¶
This section wires up our Llama “judge” prompts to the OpenAI client, parses the responses, and assembles a clean evaluation DataFrame
We’ll use pattern to pull out the free-text rationale (Evaluation:) and the numeric score (Total rating:) from each LLM response.
import re
pattern = re.compile(
r"Evaluation:\s*(.*?)\s*Total rating:\s*(\d+)",
re.DOTALL,
)
from openai import OpenAI
client = OpenAI(
base_url="https://openrouter.ai/api/v1",
api_key=os.getenv("api_key"),
)
# Sends a single prompt and returns the raw text reply.
def call_llm(client: OpenAI, prompt: str):
completion = client.chat.completions.create(
extra_body={},
model="meta-llama/llama-3.2-3b-instruct",
messages=[
{
"role": "user",
"content": prompt
}
]
)
if not completion.choices:
print(completion)
return None
return completion.choices[0].message.content
def create_eval(outputs, name=None):
tests = []
for i in range(0, len(outputs), 100):
for output in tqdm(outputs[i:i+100]):
if name and name.lower() == "patent":
eval_result = {
"context": output["context"],
"question": output["question"],
"answer": output["answer"],
"category": output["category"],
}
else:
eval_result = {
"context": output["context"],
"question": output["question"],
"answer": output["answer"],
"source_doc": output["source_doc"],
"category": output["category"],
}
try:
evaluations = {
"groundedness": call_llm(
client,
question_groundedness_critique_prompt.format(
context=output["context"], question=output["question"]
),
),
"relevance": call_llm(
client,
question_relevance_critique_prompt.format(question=output["question"]),
),
"standalone": call_llm(
client,
question_standalone_critique_prompt.format(question=output["question"]),
),
}
for criterion, evaluation in evaluations.items(): # Fixed indentation here
match = pattern.search(evaluation)
if match:
eval_text = match.group(1).strip()
score = int(match.group(2).strip())
eval_result[f"{criterion}_score"] = score
eval_result[f"{criterion}_eval"] = eval_text
else:
print(f"{criterion} evaluation did not match:")
print(evaluation)
continue
except json.JSONDecodeError as e:
continue
tests.append(eval_result)
qa_cleaned = pd.DataFrame.from_dict(tests)
if name:
qa_cleaned.to_csv(f"{name}.csv")
return qa_cleaned
6. Filtering & Quality Control¶
After running our QA pairs through the LLM based critique agents, we now have scores for each question on three key criteria: groundedness, relevance, and standalone clarity. In this step, we filter out the QA pairs that score below our threshold in any of these criteria. The following code displays the evaluation metrics before filtering and then retains only those questions that score at least 3 on each criterion.
In our evaluation, we began with a total of 694 QA pairs. After applying our filtering criteria removing any pair that scored below 3 in groundedness, relevance, or standalone clarity only 86 QA pairs remained. This significant reduction indicates that many generated QA pairs did not meet our quality standards. Such a high attrition rate may be due to the limitations of the generation process, where some questions might be ambiguous, not contextually grounded, or overly dependent on the source text. It could also reflect the sensitivity of our evaluation prompts and the complexity of the cleantech domain, particularly in technical patent contexts.
Although retaining only 86 out of 694 QA pairs might seem drastic, this rigorous filtering ensures that our final dataset contains only high quality, reliable questions. By using only the top performing QA pairs, we significantly reduce noise in our dataset, thereby establishing a robust foundation for subsequent stages of our RAG system. Ultimately, while the filtering process discards a large portion of the initially generated pairs, it enhances the overall quality, making our downstream tasks more efficient and effective.
Evaluation for the Llama3.2 generated QAs¶
The eval_qa_pairs function lets you inspect the full set of generated QA pairs, apply quality thresholds, and then view the cleaned subset:
Display Before Filtering
- Prints “Evaluation dataset before filtering:” and shows each row’s
question,answer, plus the three numeric scores (groundedness_score,relevance_score,standalone_score) and their corresponding LLM feedback.
- Prints “Evaluation dataset before filtering:” and shows each row’s
Apply Thresholds
- Keeps only those QA pairs where all scores are ≥ 3 (on a 1–5 scale) for groundedness, relevance, and standalone clarity.
Display After Filtering
- Prints a divider, then “Final evaluation dataset:” and displays just the
question,answer, and the three retained scores for the filtered set.
- Prints a divider, then “Final evaluation dataset:” and displays just the
def eval_qa_pairs(generated_questions: pd.DataFrame, n: int = 5) -> pd.DataFrame:
"""
Displays only the head (first n rows) of the QA evaluation
before and after filtering.
Parameters:
generated_questions (pd.DataFrame): DataFrame with QA pairs and eval scores.
n (int): Number of rows to display in each table (default 5).
Returns:
pd.DataFrame: The filtered DataFrame (all rows).
"""
cols_before = [
"question", "answer",
"groundedness_score", "groundedness_eval",
"relevance_score", "relevance_eval",
"standalone_score", "standalone_eval",
]
cols_after = [
"question", "answer",
"groundedness_score",
"relevance_score",
"standalone_score",
]
print(f"Evaluation dataset before filtering (showing first {n} rows):")
display(generated_questions[cols_before].head(n))
filtered = generated_questions.loc[
(generated_questions["groundedness_score"] >= 3) &
(generated_questions["relevance_score"] >= 3) &
(generated_questions["standalone_score"] >= 3)
]
print("============================================")
print(f"Final evaluation dataset (showing first {n} rows):")
display(filtered[cols_after].head(n))
return filtered
Evaluate the Media Dataset¶
Evaluate the generated QA pairs for the media QA pairs generated with LLama 3.2
generated_questions_cleaned_media_llama = create_eval(output_media_cleaned_llama,"media_dataset")
instead of regenerating the evaluation for the QA pairs, we have already run the evaluation and the results are to be seen in the datasets bellow.
generated_questions_cleaned_media_llama = pd.read_csv("datasets/generated_questions_cleaned_media_llama.csv")
len(generated_questions_cleaned_media_llama)
97
generated_questions_cleaned_media_llama = eval_qa_pairs(generated_questions_cleaned_media_llama)
Evaluation dataset before filtering (showing first 5 rows):
| question | answer | groundedness_score | groundedness_eval | relevance_score | relevance_eval | standalone_score | standalone_eval | |
|---|---|---|---|---|---|---|---|---|
| 0 | What is green hydrogen, and how is it produced? | Green hydrogen is a sustainable energy carrier produced by water electrolysis using renewable en... | 5.0 | The provided context clearly explains what green hydrogen is, how it is produced through water e... | 3.0 | The question is straightforward and clear, inquiring about a specific concept (green hydrogen) a... | 5.0 | The question is self-explanatory and does not rely on additional context to be understood. It is... |
| 1 | What is the significance of the Neom project in Saudi Arabia as a pioneering example of green hy... | The Neom project, a partnership between ACWA Power, Air Products, and NEOM, harnesses solar and ... | 5.0 | The context provides a comprehensive overview of the significance of green hydrogen and its inte... | 4.0 | The Neom project is a key concept in the context of green hydrogen integration, and understandin... | 4.0 | The question appears to rely on additional knowledge about the Neom project and its connection t... |
| 2 | What is the significance of the Energy Transitions Commission's report on making clean electrifi... | The report highlights the need for a 30-year transition to electrify the global economy, providi... | 5.0 | The question regarding the significance of the Energy Transitions Commission's report "Making Cl... | 3.0 | This question appears to be relevant to environmental sustainability and energy policy, which mi... | 5.0 | The question refers to a specific institution (Energy Transitions Commission) and a specific con... |
| 3 | How does green hydrogen compare to direct use of electricity in terms of energy efficiency? | Green hydrogen production through electrolysis is less energy-efficient than direct use of elect... | 4.0 | The question requires an in-depth analysis of the context provided, specifically focusing on the... | 4.0 | This question is relevant to NLP developers building applications that may use energy-intensive ... | 5.0 | The question implies that there might be some universal or general information about green hydro... |
| 4 | What are some of the examples of pilot projects testing the viability of green hydrogen in vario... | Various pilot projects are testing green hydrogen's viability in energy systems, demonstrating i... | 5.0 | The question can be answered unambiguously based on the provided context, which describes variou... | 3.0 | This question appears to be focused on environmental sustainability and energy systems, which is... | 4.0 | The question asks for specific examples of pilot projects, which implies the existence of a cont... |
============================================ Final evaluation dataset (showing first 5 rows):
| question | answer | groundedness_score | relevance_score | standalone_score | |
|---|---|---|---|---|---|
| 0 | What is green hydrogen, and how is it produced? | Green hydrogen is a sustainable energy carrier produced by water electrolysis using renewable en... | 5.0 | 3.0 | 5.0 |
| 1 | What is the significance of the Neom project in Saudi Arabia as a pioneering example of green hy... | The Neom project, a partnership between ACWA Power, Air Products, and NEOM, harnesses solar and ... | 5.0 | 4.0 | 4.0 |
| 2 | What is the significance of the Energy Transitions Commission's report on making clean electrifi... | The report highlights the need for a 30-year transition to electrify the global economy, providi... | 5.0 | 3.0 | 5.0 |
| 3 | How does green hydrogen compare to direct use of electricity in terms of energy efficiency? | Green hydrogen production through electrolysis is less energy-efficient than direct use of elect... | 4.0 | 4.0 | 5.0 |
| 4 | What are some of the examples of pilot projects testing the viability of green hydrogen in vario... | Various pilot projects are testing green hydrogen's viability in energy systems, demonstrating i... | 5.0 | 3.0 | 4.0 |
generated_questions_cleaned_media_llama.to_csv("data/generated_questions_cleaned_media_llama.csv")
Now let's have a look at the evaluated questions
generated_questions_cleaned_media_llama['question'].head()
| question | |
|---|---|
| 0 | What is green hydrogen, and how is it produced? |
| 1 | What is the significance of the Neom project in Saudi Arabia as a pioneering example of green hy... |
| 2 | What is the significance of the Energy Transitions Commission's report on making clean electrifi... |
| 3 | How does green hydrogen compare to direct use of electricity in terms of energy efficiency? |
| 4 | What are some of the examples of pilot projects testing the viability of green hydrogen in vario... |
Evaluate the Patent Dataset¶
Evaluate the generated QA pairs for the patent QA pairs generated with LLama 3.2
generated_questions_cleaned_patent_llama = create_eval(output_patent_cleaned_llama,"patent")
generated_questions_cleaned_patent_llama = pd.read_csv("datasets/generated_questions_cleaned_patent_llama.csv")
generated_questions_cleaned_patent_llama = eval_qa_pairs(generated_questions_cleaned_patent_llama)
Evaluation dataset before filtering (showing first 5 rows):
| question | answer | groundedness_score | groundedness_eval | relevance_score | relevance_eval | standalone_score | standalone_eval | |
|---|---|---|---|---|---|---|---|---|
| 0 | How does the system reduce the cost of cold storage? | The system reduces the cost of cold storage by converting solar energy into electric energy, whi... | 5.0 | The question can be answered based on the given context, and the answer is clear and unambiguous. | 4.0 | The question is concise and to the point, directly asking about a specific aspect of how the Hug... | 5.0 | The question appears to be related to a general concept of cost reduction in the context of data... |
| 1 | What is good about the utility model? | The utility model has a simple structure that is easy to assemble and disassemble. | 5.0 | The question "What is good about the utility model?" is somewhat ambiguous without further clari... | 3.0 | The question is very short and to the point, but it lacks context and detail about the specific ... | 5.0 | The question seems to be asking about a general property or characteristic of a "utility model",... |
| 2 | What is the purpose of using power supply with better regulation performance to compensate for p... | The power supply with better regulation performance is used to carry out compensation regulation... | 5.0 | The question is clearly answerable by understanding the purpose of using power supply with bette... | 3.0 | The question is directly related to power supply regulation and its impact on the performance of... | 5.0 | The question assumes knowledge of power supplies in general, specifically their regulation perfo... |
| 3 | What is the water conservancy and hydropower engineering construction hole inner diversion drain... | The utility model discloses a water conservancy and hydropower engineering construction hole inn... | 5.0 | The question is clearly answerable with the given context, as it describes a specific inner dive... | 3.0 | The question seems to be about a specific technical term, which may be useful for machine learni... | 5.0 | The question contains technical terms and a specific reference to a concept that appears to be w... |
| 4 | What is the main consideration for the balancing method, in addition to safety and economy? | The seasonal characteristics of renewable energy sources in time and the coordination problem of... | 4.0 | The context provides a detailed description of a method for balancing electric quantity in a pow... | 4.0 | The question is asking about a specific aspect of the balancing method, which is a common techni... | 5.0 | The question does not provide a specific context, and the balancing method is a general concept ... |
============================================ Final evaluation dataset (showing first 5 rows):
| question | answer | groundedness_score | relevance_score | standalone_score | |
|---|---|---|---|---|---|
| 0 | How does the system reduce the cost of cold storage? | The system reduces the cost of cold storage by converting solar energy into electric energy, whi... | 5.0 | 4.0 | 5.0 |
| 1 | What is good about the utility model? | The utility model has a simple structure that is easy to assemble and disassemble. | 5.0 | 3.0 | 5.0 |
| 2 | What is the purpose of using power supply with better regulation performance to compensate for p... | The power supply with better regulation performance is used to carry out compensation regulation... | 5.0 | 3.0 | 5.0 |
| 3 | What is the water conservancy and hydropower engineering construction hole inner diversion drain... | The utility model discloses a water conservancy and hydropower engineering construction hole inn... | 5.0 | 3.0 | 5.0 |
| 4 | What is the main consideration for the balancing method, in addition to safety and economy? | The seasonal characteristics of renewable energy sources in time and the coordination problem of... | 4.0 | 4.0 | 5.0 |
len(generated_questions_cleaned_patent_llama)
32
generated_questions_cleaned_patent_llama['question'].head()
| question | |
|---|---|
| 0 | How does the system reduce the cost of cold storage? |
| 1 | What is good about the utility model? |
| 2 | What is the purpose of using power supply with better regulation performance to compensate for p... |
| 3 | What is the water conservancy and hydropower engineering construction hole inner diversion drain... |
| 4 | What is the main consideration for the balancing method, in addition to safety and economy? |
generated_questions_cleaned_patent_llama.to_csv("generated_questions_cleaned_patent_llama.csv")
From the set of five questions below, we observe several key characteristics that demonstrate the quality and diversity of the generated QA pairs. Overall, these examples illustrate that our process successfully produced questions that are contextually grounded, relevant, and thought-provoking:
The questions address both technical and real world concerns for example, one question explores the potential for recycling electric car batteries and its broader environmental and economic benefits, while another discusses the impact of global warming on wildlife. This shows that our QA pairs capture a range of topics within the cleantech domain.
They integrate multiple aspects of the domain. Questions connect technological aspects (like recycling challenges and grid stability with V2X technology) with behavioral or market trends (such as the influence of owning an electric vehicle on solar panel installations). This blend of technical and contextual elements makes the QA pairs more comprehensive.
The questions are formulated to invite detailed and informative answers. By asking about challenges and potential solutions in one case or the broader implications of a technology in another, these questions encourage responses that are both analytical and forward looking.
In general, the generated questions are clear and self contained, meaning they can stand alone without needing extensive external context. This quality is crucial for ensuring that the QA pairs are useful in downstream tasks, like feeding into a retrieval augmented generation system.
In summary, the diversity and clarity of these questions from technical specifics to broader market and environmental implications demonstrate that our QA pair generation method is effective at producing high quality, varied, and contextually rich questions across different aspects of cleantech innovation.
These questions collectively show that our approach results in a robust QA dataset that captures various dimensions of the cleantech field.
Evaluation for the Deepseek generated QAs¶
for the Media Dataset¶
Evaluate the generated QA pairs for the media QA pairs generated with DeepSeek R1.
generated_questions_cleaned_media_deepseek = pd.read_csv("datasets/generated_questions_cleaned_media_deepseek.csv")
generated_questions_cleaned_media_deepseek = create_eval(output_media_cleaned_deepseek,"media_dataset")
generated_questions_cleaned_media_deepseek = eval_qa_pairs(generated_questions_cleaned_media_deepseek)
Evaluation dataset before filtering (showing first 5 rows):
| question | answer | groundedness_score | groundedness_eval | relevance_score | relevance_eval | standalone_score | standalone_eval | |
|---|---|---|---|---|---|---|---|---|
| 0 | ** What is green hydrogen, and how is it produced? \n** | ** Green hydrogen is a sustainable energy carrier produced through water electrolysis powered by... | 5.0 | The provided context thoroughly explains what green hydrogen is and how it is produced through w... | 3 | This question touches on a specific topic in the context of sustainability and renewable energy,... | 4.0 | This question appears to be self-contained and refers to a general concept, but it would benefit... |
| 1 | Why are polar bears in Hudson Bay particularly vulnerable to longer ice-free periods? | Polar bears rely on sea ice to hunt seals, their primary food source. Longer ice-free periods fo... | 5.0 | The question is clearly answerable as it is directly tied to the specific location and the impac... | 4 | This question is useful for machine learning developers building NLP applications because it req... | 5.0 | The question implies a specific context or setting, which is Hudson Bay, and refers to a particu... |
| 2 | How does direct recycling differ from conventional battery recycling methods? | Conventional recycling chemically breaks down battery components into raw materials, while direc... | 5.0 | The passage provides sufficient information to answer the question unambiguously, as it describe... | 4 | The question is straightforward and addresses a specific comparison between two methods, which i... | 5.0 | The question seems clear and asks for a direct comparison between two methods without referencin... |
| 3 | What did the NREL study find about the relationship between electric vehicle (EV) owners and sol... | The study found that EV owners are more likely to install solar panels in their homes, with 25% ... | 5.0 | The question is clearly and unambiguously answerable with the given context, as the text provide... | 3 | The question is relatively specific and asks about a specific study, which is a good starting po... | 4.0 | The question appears to be self-explanatory and does not require any external context to be unde... |
| 4 | Why might EV owners be more inclined to adopt solar technology? | According to Shivam Sharda, the lead author of the study, installing solar panels can offset the... | 5.0 | The context provides several reasons why EV owners might be more inclined to adopt solar technol... | 3 | This question touches on a relevant and timely topic in the context of NLP and machine learning,... | 4.0 | The question is clear and informational, but it lacks specific context and relies on the reader ... |
============================================ Final evaluation dataset (showing first 5 rows):
| question | answer | groundedness_score | relevance_score | standalone_score | |
|---|---|---|---|---|---|
| 0 | ** What is green hydrogen, and how is it produced? \n** | ** Green hydrogen is a sustainable energy carrier produced through water electrolysis powered by... | 5.0 | 3 | 4.0 |
| 1 | Why are polar bears in Hudson Bay particularly vulnerable to longer ice-free periods? | Polar bears rely on sea ice to hunt seals, their primary food source. Longer ice-free periods fo... | 5.0 | 4 | 5.0 |
| 2 | How does direct recycling differ from conventional battery recycling methods? | Conventional recycling chemically breaks down battery components into raw materials, while direc... | 5.0 | 4 | 5.0 |
| 3 | What did the NREL study find about the relationship between electric vehicle (EV) owners and sol... | The study found that EV owners are more likely to install solar panels in their homes, with 25% ... | 5.0 | 3 | 4.0 |
| 4 | Why might EV owners be more inclined to adopt solar technology? | According to Shivam Sharda, the lead author of the study, installing solar panels can offset the... | 5.0 | 3 | 4.0 |
print(f" we have {len(generated_questions_cleaned_media_deepseek)} quality QAs")
generated_questions_cleaned_media_deepseek['question'].head()
generated_questions_cleaned_media_deepseek.to_csv("data/generated_questions_cleaned_media_deepseek.csv")
For the Patent Dataset¶
Evaluate the generated QA pairs for the patent QA pairs generated with DeepSeek R1.
generated_questions_cleaned_patent_deepseek = pd.read_csv("datasets/generated_questions_cleaned_patent_deepseek.csv")
generated_questions_cleaned_patent_deepseek = create_eval(output_patent_cleaned_deepseek,"patent")
generated_questions_cleaned_patent_deepseek = eval_qa_pairs(generated_questions_cleaned_patent_deepseek )
print(f" we have {len(generated_questions_cleaned_patent_deepseek)} quality QAs")
generated_questions_cleaned_patent_deepseek['question'].head()
generated_questions_cleaned_patent_deepseek.to_csv("data/generated_questions_cleaned_patent_deepseek.csv")
7. Plotting Evaluation Score Distributions¶
This helper plots the frequency of a given evaluation score (groundedness, relevance, or standalone) to help you quickly see how your QA pairs fare
By visualizing each criterion’s distribution, you can identify if most QA pairs cluster at high scores or if there’s a long tail of low-quality examples that may need further filtering.
def plot_evaluation_distribution(df, score_field, title):
plt.figure(figsize=(8, 4))
sns.histplot(df[score_field], bins=5, kde=False)
plt.title(title)
plt.xlabel("Score")
plt.ylabel("Frequency")
plt.xticks(range(1, 7))
plt.show()
plot_evaluation_distribution(generated_questions_cleaned_media_llama, "groundedness_score", "Distribution of Groundedness Scores")
plot_evaluation_distribution(generated_questions_cleaned_media_llama, "relevance_score", "Distribution of Relevance Scores")
plot_evaluation_distribution(generated_questions_cleaned_media_llama, "standalone_score", "Distribution of Standalone Scores")
plot_evaluation_distribution(generated_questions_cleaned_patent_llama, "groundedness_score", "Distribution of Groundedness Scores")
plot_evaluation_distribution(generated_questions_cleaned_patent_llama, "relevance_score", "Distribution of Relevance Scores")
plot_evaluation_distribution(generated_questions_cleaned_patent_llama, "standalone_score", "Distribution of Standalone Scores")
Evaluation of the Human Evaluation Dataset Analysis¶
In addition to our generated QA pairs, we evalutate the human evaluation dataset(cleantech_rag_evaluation_data_2024-09-20.csv). This dataset is a small, high quality evaluation set designed to benchmark the retrieval step in the RAG system. The human evaluators have already curated these 28 examples for quality.
We pass this dataset through our classification and evaluation pipeline to see how well our automated QA evaluation compares. After running the examples through the pipeline (which includes our LLM based critique agents and filtering logic), we found that only 4 out of the 28 examples were classified as good.
This outcome highlights a few key insights:
High Standards for Quality: Our pipeline uses strict evaluation metrics for groundedness, relevance, and standalone clarity. The fact that only 4 examples met our threshold indicates that these metrics are effective at distinguishing high quality QA pairs from those that are ambiguous, irrelevant, or overly context dependent.
Differences in Human vs. Automated Judgment: Although the human evaluation dataset is curated for quality, our automated critique agents may apply slightly different standards. This gap can arise due to the inherent differences between human judgment and the deterministic nature of our LLM based scoring.
Potential Areas for Improvement: The low pass rate (4/28) suggests that even the high quality human-curated examples can sometimes fall short of our strict automatic evaluation criteria. This may be due to subtle issues in the questions or context that our models pick up on. It also indicates that our QA generation and evaluation techniques could be further fine tuned to align better with human judgments.
Implications for the Retrieval Step: Since this evaluation dataset is specifically for assessing the Retrieval component of our RAG system, maintaining a high quality benchmark is crucial. Even though our automated pipeline filters aggressively, ensuring that the retrieval component consistently returns the top-quality examples is a key factor in the overall system performance.
In summary, processing the human evaluation dataset demonstrates the robustness of our filtering and quality control process, while also revealing areas where our automated system might differ from human expectations. This insight is valuable for tuning both our QA generation methods and our evaluation criteria to create an optimal RAG system.
human_eval.head(3)[['question','relevant_text','answer','article_url']]
| question | relevant_text | answer | article_url | |
|---|---|---|---|---|
| 0 | What is the innovation behind Leclanché's new method to produce lithium-ion batteries? | Leclanché said it has developed an environmentally friendly way to produce lithium-ion (Li-ion)... | Leclanché's innovation is using a water-based process instead of highly toxic organic solvents ... | https://www.sgvoice.net/strategy/technology/23971/leclanches-new-disruptive-battery-boosts-energ... |
| 1 | What is the EUâs Green Deal Industrial Plan? | The Green Deal Industrial Plan is a bid by the EU to make its net zero industry more competitive... | The EUâs Green Deal Industrial Plan aims to enhance the competitiveness of its net zero indust... | https://www.sgvoice.net/policy/25396/eu-seeks-competitive-boost-with-green-deal-industrial-plan/ |
| 2 | What is the EUâs Green Deal Industrial Plan? | The European counterpart to the US Inflation Reduction Act (IRA) aims to create an environment t... | The EUâs Green Deal Industrial Plan aims to enhance the competitiveness of its net zero indust... | https://www.pv-magazine.com/2023/02/02/european-commission-introduces-green-deal-industrial-plan/ |
dfs = [human_eval]
human_eval_categorized_sentence_bert = stack_and_categorize(dfs)
human_eval_categorized_sentence_bert.head(3)[['question','relevant_text','answer','article_url']]
| question | relevant_text | answer | article_url | |
|---|---|---|---|---|
| 0 | What is the innovation behind Leclanché's new method to produce lithium-ion batteries? | Leclanché said it has developed an environmentally friendly way to produce lithium-ion (Li-ion)... | Leclanché's innovation is using a water-based process instead of highly toxic organic solvents ... | https://www.sgvoice.net/strategy/technology/23971/leclanches-new-disruptive-battery-boosts-energ... |
| 1 | What is the EUâs Green Deal Industrial Plan? | The Green Deal Industrial Plan is a bid by the EU to make its net zero industry more competitive... | The EUâs Green Deal Industrial Plan aims to enhance the competitiveness of its net zero indust... | https://www.sgvoice.net/policy/25396/eu-seeks-competitive-boost-with-green-deal-industrial-plan/ |
| 2 | What is the EUâs Green Deal Industrial Plan? | The European counterpart to the US Inflation Reduction Act (IRA) aims to create an environment t... | The EUâs Green Deal Industrial Plan aims to enhance the competitiveness of its net zero indust... | https://www.pv-magazine.com/2023/02/02/european-commission-introduces-green-deal-industrial-plan/ |
outputs_human_eval = generate_output_list(human_eval_categorized_sentence_bert,name= 'human_eval')
results_human_eval = pd.read_csv("datasets/results_human_eval.csv")
results_human_eval = create_eval(outputs_human_eval)
results_human_eval.to_csv("data/results_human_eval.csv")
evaluated_results = eval_qa_pairs(results_human_eval)
Evaluation dataset before filtering (showing first 5 rows):
| question | answer | groundedness_score | groundedness_eval | relevance_score | relevance_eval | standalone_score | standalone_eval | |
|---|---|---|---|---|---|---|---|---|
| 0 | What is the innovation behind Leclanché's new method to produce lithium-ion batteries? | Leclanché's innovation is using a water-based process instead of highly toxic organic solvents ... | 3 | The evaluation does not provide a clear understanding of the innovation, as the context only men... | 1 | The question asks about a specific innovation related to lithium-ion batteries, which is a techn... | 1 | The question implies that the reader is familiar with Leclanché, a company, and its products, sp... |
| 1 | What is the EUâs Green Deal Industrial Plan? | The EUâs Green Deal Industrial Plan aims to enhance the competitiveness of its net zero indust... | 5 | The context provides sufficient information about the purpose and objectives of the EU's Green D... | 2 | The given question appears to be a factual query about a specific policy initiative by the Europ... | 1 | The question explicitly mentions "the EU's" which indicates a reference to the European Union an... |
| 2 | What is the EUâs Green Deal Industrial Plan? | The EUâs Green Deal Industrial Plan aims to enhance the competitiveness of its net zero indust... | 4 | Based on the given context, it is clear that the question is related to the EU's environmental a... | 2 | The question seems to be relatively vague and does not directly relate to NLP or the Hugging Fac... | 4 | The question refers to the European Union's policy, specifically mentioning "EU", a context-depe... |
| 3 | What are the four focus areas of the EU's Green Deal Industrial Plan? | The four focus areas of the EU's Green Deal Industrial Plan are the regulatory environment, acce... | 5 | The context provides a clear and concise description of the four focus areas of the EU's Green D... | 2 | This question appears to be brief and informative, covering a specific topic related to a well-k... | 5 | The question uses a specific and well-known term ("EU's Green Deal Industrial Plan") that refere... |
| 4 | When did the cooperation between GM and Honda on fuel cell vehicles start? | July 2013 | 5 | The question about when the cooperation between GM and Honda on fuel cell vehicles started can b... | 2 | The question about the cooperation between GM and Honda on fuel cell vehicles is a relatively sp... | 1 | The question implicitly refers to a well-known partnership between the companies, but it is not ... |
============================================ Final evaluation dataset (showing first 5 rows):
| question | answer | groundedness_score | relevance_score | standalone_score | |
|---|---|---|---|---|---|
| 11 | Why are milder winters detrimental for grapes and apples? | Milder winters are detrimental for grapes and apples because these perennial species rely on con... | 4 | 3 | 4 |
| 17 | What is Agrivoltaics? | the integration of food production and solar energy to make better use of limited land and soil ... | 4 | 4 | 5 |
| 18 | What is Agrivoltaics? | the integration of food production and solar energy to make better use of limited land and soil ... | 5 | 4 | 5 |
| 20 | What are the obstacles for cannabis producers when it comes to using solar energy? | limited access to financial instruments, inability to secure standard loans or mortgages, lack o... | 4 | 3 | 5 |
len(evaluated_results)
4
plot_evaluation_distribution(evaluated_results, "groundedness_score", "Distribution of Groundedness Scores")
plot_evaluation_distribution(evaluated_results, "relevance_score", "Distribution of Relevance Scores")
plot_evaluation_distribution(evaluated_results, "standalone_score", "Distribution of Standalone Scores")
8. Conclusion & Next Steps¶
In this notebook, we explored the process of creating and analyzing QA pairs from 50–100 carefully selected paragraphs from the Cleantech Media and Google Patent datasets. By generating, classifying, and rigorously evaluating these QA pairs using LLMs, we built a high quality dataset that serves as a robust foundation for our RAG pipeline.
Next Steps:
Integrate the filtered QA dataset into the implemeted RAG system.
Experiment with different LLM prompts and filtering thresholds to refine your QA pair quality.
Use the insights gained from the evaluation visualizations to further tune your retrieval and generation processes.